Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 52
Filter
1.
Article in English | MEDLINE | ID: mdl-38194384

ABSTRACT

Unsupervised anomaly detection (UAD) attracts a lot of research interest and drives widespread applications, where only anomaly-free samples are available for training. Some UAD applications intend to locate the anomalous regions further even without any anomaly information. Although the absence of anomalous samples and annotations deteriorates the UAD performance, an inconspicuous, yet powerful statistics model, the normalizing flows, is appropriate for anomaly detection (AD) and localization in an unsupervised fashion. The flow-based probabilistic models, only trained on anomaly-free data, can efficiently distinguish unpredictable anomalies by assigning them much lower likelihoods than normal data. Nevertheless, the size variation of unpredictable anomalies introduces another inconvenience to the flow-based methods for high-precision AD and localization. To generalize the anomaly size variation, we propose a novel multiscale flow-based framework (MSFlow) composed of asymmetrical parallel flows followed by a fusion flow to exchange multiscale perceptions. Moreover, different multiscale aggregation strategies are adopted for image-wise AD and pixel-wise anomaly localization according to the discrepancy between them. The proposed MSFlow is evaluated on three AD datasets, significantly outperforming existing methods. Notably, on the challenging MVTec AD benchmark, our MSFlow achieves a new state-of-the-art (SOTA) with a detection AUORC score of up to 99.7%, localization AUCROC score of 98.8% and PRO score of 97.1%.

2.
IEEE Trans Image Process ; 32: 5909-5920, 2023.
Article in English | MEDLINE | ID: mdl-37883290

ABSTRACT

The optical flow guidance strategy is ideal for obtaining motion information of objects in the video. It is widely utilized in video segmentation tasks. However, existing optical flow-based methods have a significant dependency on optical flow, which results in poor performance when the optical flow estimation fails for a particular scene. The temporal consistency provided by the optical flow could be effectively supplemented by modeling in a structural form. This paper proposes a new hierarchical graph neural network (GNN) architecture, dubbed hierarchical graph pattern understanding (HGPU), for zero-shot video object segmentation (ZS-VOS). Inspired by the strong ability of GNNs in capturing structural relations, HGPU innovatively leverages motion cues (i.e., optical flow) to enhance the high-order representations from the neighbors of target frames. Specifically, a hierarchical graph pattern encoder with message aggregation is introduced to acquire different levels of motion and appearance features in a sequential manner. Furthermore, a decoder is designed for hierarchically parsing and understanding the transformed multi-modal contexts to achieve more accurate and robust results. HGPU achieves state-of-the-art performance on four publicly available benchmarks (DAVIS-16, YouTube-Objects, Long-Videos and DAVIS-17). Code and pre-trained model can be found at https://github.com/NUST-Machine-Intelligence-Laboratory/HGPU.

3.
Article in English | MEDLINE | ID: mdl-37028297

ABSTRACT

Embodied question answering (EQA) is a recently emerged research field in which an agent is asked to answer the user's questions by exploring the environment and collecting visual information. Plenty of researchers turn their attention to the EQA field due to its broad potential application areas, such as in-home robots, self-driven mobile, and personal assistants. High-level visual tasks, such as EQA, are susceptible to noisy inputs, because they have complex reasoning processes. Before the profits of the EQA field can be applied to practical applications, good robustness against label noise needs to be equipped. To tackle this problem, we propose a novel label noise-robust learning algorithm for the EQA task. First, a joint training co-regularization noise-robust learning method is proposed for noisy filtering of the visual question answering (VQA) module, which trains two parallel network branches by one loss function. Then, a two-stage hierarchical robust learning algorithm is proposed to filter out noisy navigation labels in both trajectory level and action level. Finally, by taking purified labels as inputs, a joint robust learning mechanism is given to coordinate the work of the whole EQA system. Empirical results demonstrate that, under extremely noisy environments (45% of noisy labels) and low-level noisy environments (20% of noisy labels), the robustness of deep learning models trained by our algorithm is superior to the existing EQA models in noisy environments.

4.
IEEE Trans Image Process ; 32: 2348-2359, 2023.
Article in English | MEDLINE | ID: mdl-37074884

ABSTRACT

Zero-shot video object segmentation (ZS-VOS) aims to segment foreground objects in a video sequence without prior knowledge of these objects. However, existing ZS-VOS methods often struggle to distinguish between foreground and background or to keep track of the foreground in complex scenarios. The common practice of introducing motion information, such as optical flow, can lead to overreliance on optical flow estimation. To address these challenges, we propose an encoder-decoder-based hierarchical co-attention propagation network (HCPN) capable of tracking and segmenting objects. Specifically, our model is built upon multiple collaborative evolutions of the parallel co-attention module (PCM) and the cross co-attention module (CCM). PCM captures common foreground regions among adjacent appearance and motion features, while CCM further exploits and fuses cross-modal motion features returned by PCM. Our method is progressively trained to achieve hierarchical spatio-temporal feature propagation across the entire video. Experimental results demonstrate that our HCPN outperforms all previous methods on public benchmarks, showcasing its effectiveness for ZS-VOS. Code and pre-trained model can be found at https://github.com/NUST-Machine-Intelligence-Laboratory/HCPN.

5.
Article in English | MEDLINE | ID: mdl-36264718

ABSTRACT

Temporal language grounding (TLG) is one of the most challenging cross-modal video understanding tasks, which aims at retrieving the most relevant video segment from an untrimmed video according to a natural language sentence. The existing methods can be separated into two dominant types: 1) proposal-based and 2) proposal-free methods, where the former conduct contextual interactions and the latter localizes timestamps flexibly. However, the constant-scale candidates in proposal-based methods limit the localization precision and bring extra computational costs. In contrast, the proposal-free methods perform well on high-precision metrics-based on the fine-grained features but suffer from a lack of coarse-grained interactions, which cause degeneration when the video becomes complex. In this article, we propose a novel framework termed semantic decoupling network (SDN) that combines the advantages of proposal-based and proposal-free methods and overcomes their defects. It contains three key components: 1) semantic decoupling module (SDM); 2) context modeling block (CMB); and 3) semantic cross-level aggregation module (SCAM). By capturing the video-text contexts in multilevel semantics, the SDM and CMB effectively utilize the benefits of proposal-based methods. Meanwhile, the SCAM maintains the merit of proposal-free methods in that it localizes timestamps precisely. The experiments on three challenge datasets, i.e., Charades-STA, TACoS, and ActivityNet-Caption, show that our proposed SDN method significantly outperforms recent state-of-the-art methods, especially the proposal-free methods. Extensive analyses, as well as the implementation code of the proposed SDN method, are provided at https://github.com/CFM-MSG/Code_SDN.

6.
IEEE Trans Neural Netw Learn Syst ; 33(7): 3050-3064, 2022 Jul.
Article in English | MEDLINE | ID: mdl-33646956

ABSTRACT

MixUp is an effective data augmentation method to regularize deep neural networks via random linear interpolations between pairs of samples and their labels. It plays an important role in model regularization, semisupervised learning (SSL), and domain adaption. However, despite its empirical success, its deficiency of randomly mixing samples has poorly been studied. Since deep networks are capable of memorizing the entire data set, the corrupted samples generated by vanilla MixUp with a badly chosen interpolation policy will degrade the performance of networks. To overcome overfitting to corrupted samples, inspired by metalearning (learning to learn), we propose a novel technique of learning to a mixup in this work, namely, MetaMixUp. Unlike the vanilla MixUp that samples interpolation policy from a predefined distribution, this article introduces a metalearning-based online optimization approach to dynamically learn the interpolation policy in a data-adaptive way (learning to learn better). The validation set performance via metalearning captures the noisy degree, which provides optimal directions for interpolation policy learning. Furthermore, we adapt our method for pseudolabel-based SSL along with a refined pseudolabeling strategy. In our experiments, our method achieves better performance than vanilla MixUp and its variants under SL configuration. In particular, extensive experiments show that our MetaMixUp adapted SSL greatly outperforms MixUp and many state-of-the-art methods on CIFAR-10 and SVHN benchmarks under the SSL configuration.

7.
IEEE Trans Pattern Anal Mach Intell ; 44(9): 4524-4543, 2022 Sep.
Article in English | MEDLINE | ID: mdl-33798072

ABSTRACT

Binary optimization problems (BOPs) arise naturally in many fields, such as information retrieval, computer vision, and machine learning. Most existing binary optimization methods either use continuous relaxation which can cause large quantization errors, or incorporate a highly specific algorithm that can only be used for particular loss functions. To overcome these difficulties, we propose a novel generalized optimization method, named Alternating Binary Matrix Optimization (ABMO), for solving BOPs. ABMO can handle BOPs with/without orthogonality or linear constraints for a large class of loss functions. ABMO involves rewriting the binary, orthogonality and linear constraints for BOPs as an intersection of two closed sets, then iteratively dividing the original problems into several small optimization problems that can be solved as closed forms. To provide a strict theoretical convergence analysis, we add a sufficiently small perturbation and translate the original problem to an approximated problem whose feasible set is continuous. We not only provide rigorous mathematical proof for the convergence to a stationary and feasible point, but also derive the convergence rate of the proposed algorithm. The promising results obtained from four binary optimization tasks validate the superiority and the generality of ABMO compared with the state-of-the-art methods.

8.
Molecules ; 26(18)2021 Sep 13.
Article in English | MEDLINE | ID: mdl-34577023

ABSTRACT

A simple and rapid method for efficient synthesis of sulfonyl chlorides/bromides from sulfonyl hydrazide with NXS (X = Cl or Br) and late-stage conversion to several other functional groups was described. A variety of nucleophiles could be engaged in this transformation, thus permitting the synthesis of complex sulfonamides and sulfonates. In most cases, these reactions are highly selective, simple, and clean, affording products at excellent yields.

9.
IEEE Trans Image Process ; 30: 7776-7789, 2021.
Article in English | MEDLINE | ID: mdl-34495830

ABSTRACT

Person Re-identification (ReID) aims to retrieve the pedestrian with the same identity across different views. Existing studies mainly focus on improving accuracy, while ignoring their efficiency. Recently, several hash based methods have been proposed. Despite their improvement in efficiency, there still exists an unacceptable gap in accuracy between these methods and real-valued ones. Besides, few attempts have been made to simultaneously explicitly reduce redundancy and improve discrimination of hash codes, especially for short ones. Integrating Mutual learning may be a possible solution to reach this goal. However, it fails to utilize the complementary effect of teacher and student models. Additionally, it will degrade the performance of teacher models by treating two models equally. To address these issues, we propose a salience-guided iterative asymmetric mutual hashing (SIAMH) to achieve high-quality hash code generation and fast feature extraction. Specifically, a salience-guided self-distillation branch (SSB) is proposed to enable SIAMH to generate hash codes based on salience regions, thus explicitly reducing the redundancy between codes. Moreover, a novel iterative asymmetric mutual training strategy (IAMT) is proposed to alleviate drawbacks of common mutual learning, which can continuously refine the discriminative regions for SSB and extract regularized dark knowledge for two models as well. Extensive experiment results on five widely used datasets demonstrate the superiority of the proposed method in efficiency and accuracy when compared with existing state-of-the-art hashing and real-valued approaches. The code is released at https://github.com/Vill-Lab/SIAMH.


Subject(s)
Algorithms , Pedestrians , Humans
10.
IEEE Trans Neural Netw Learn Syst ; 32(10): 4514-4528, 2021 Oct.
Article in English | MEDLINE | ID: mdl-32903190

ABSTRACT

Semantic-preserving hashing establishes efficient multimedia retrieval by transferring knowledge from original data to hash codes so that the latter can preserve the underlying visual and semantic similarities. However, it becomes a crucial bottleneck: how to effectively bridge the trilateral domain gaps (i.e., the visual, semantic, and hashing spaces) to further improve the retrieval accuracy. In this article, we propose an inductive structure consistent hashing (ISCH) method, which can interactively coordinate the semantic correlations between the visual feature space, the binary class space, and the discrete hashing space. Specifically, an inductive semantic space is formulated by a simple multilayer stacking class-encoder, which transforms the naive class information into flexible semantic embeddings. Meanwhile, we design a semantic dictionary learning model to facilitate the bilateral visual-semantic bridging and guide the class-encoder toward reliable semantics, which could well alleviate the visual-semantic bias problem. In particular, the visual descriptors and respective semantic class representations are regularized with a coinciding alignment module. In order to generate privileged hash codes, we further explore semantic and prototype binary code learning to jointly quantify the semantic and latent visual representations into unified discrete hash codes. Moreover, an efficient optimization algorithm is developed to address the resulting discrete programming problem. Comprehensive experiments conducted on four large-scale data sets, i.e., CIFAR-10, NUSWIDE, ImageNet, and MSCOCO, demonstrate the superiority of our method over the state-of-the-art alternatives against different evaluation protocols.

11.
Article in English | MEDLINE | ID: mdl-32286981

ABSTRACT

Despite great success has been achieved in activity analysis, it still has many challenges. Most existing works in activity recognition pay more attention to designing efficient architecture or video sampling strategy. However, due to the property of fine-grained action and long term structure in video, activity recognition is expected to reason temporal relation between video sequences. In this paper, we propose an efficient temporal reasoning graph (TRG) to simultaneously capture the appearance features and temporal relation between video sequences at multiple time scales. Specifically, we construct learnable temporal relation graphs to explore temporal relation on the multi-scale range. Additionally, to facilitate multi-scale temporal relation extraction, we design a multi-head temporal adjacent matrix to represent multi-kinds of temporal relations. Eventually, a multi-head temporal relation aggregator is proposed to extract the semantic meaning of those features convolving through the graphs. Extensive experiments are performed on widely-used large-scale datasets, such as Something-Something, Charades and Jester, and the results show that our model can achieve stateof- the-art performance. Further analysis shows that temporal relation reasoning with our TRG can extract discriminative features for activity recognition.

12.
IEEE Trans Neural Netw Learn Syst ; 31(12): 5412-5425, 2020 Dec.
Article in English | MEDLINE | ID: mdl-32071004

ABSTRACT

The task of image-text matching refers to measuring the visual-semantic similarity between an image and a sentence. Recently, the fine-grained matching methods that explore the local alignment between the image regions and the sentence words have shown advance in inferring the image-text correspondence by aggregating pairwise region-word similarity. However, the local alignment is hard to achieve as some important image regions may be inaccurately detected or even missing. Meanwhile, some words with high-level semantics cannot be strictly corresponding to a single-image region. To tackle these problems, we address the importance of exploiting the global semantic consistence between image regions and sentence words as complementary for the local alignment. In this article, we propose a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image-text matching. The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence. It directly extracts semantic labels from available sentence corpus without additional labor cost, which further provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment. Extensive experiments on Flickr30k and Microsoft COCO (MSCOCO) data sets demonstrate the effectiveness of the proposed CASC on preserving global semantic consistence along with the local alignment and further show its superior image-text matching performance compared with more than 15 state-of-the-art methods.

13.
IEEE Trans Neural Netw Learn Syst ; 31(7): 2348-2360, 2020 Jul.
Article in English | MEDLINE | ID: mdl-32012029

ABSTRACT

Studies present that dividing categories into subcategories contributes to better image classification. Existing image subcategorization works relying on expert knowledge and labeled images are both time-consuming and labor-intensive. In this article, we propose to select and subsequently classify images into categories and subcategories. Specifically, we first obtain a list of candidate subcategory labels from untagged corpora. Then, we purify these subcategory labels through calculating the relevance to the target category. To suppress the search error and noisy subcategory label-induced outlier images, we formulate outlier images removing and the optimal classification models learning as a unified problem to jointly learn multiple classifiers, where the classifier for a category is obtained by combining multiple subcategory classifiers. Compared with the existing subcategorization works, our approach eliminates the dependence on expert knowledge and labeled images. Extensive experiments on image categorization and subcategorization demonstrate the superiority of our proposed approach.

14.
IEEE Trans Cybern ; 50(4): 1460-1472, 2020 Apr.
Article in English | MEDLINE | ID: mdl-30571653

ABSTRACT

Recently, graph-based hashing that learns similarity-preserving binary codes via an affinity graph has been extensively studied for large-scale image retrieval. However, most graph-based hashing methods resort to intractable binary quadratic programs, making them unscalable to massive data. In this paper, we propose a novel graph convolutional network-based hashing framework, dubbed GCNH, which directly carries out spectral convolution operations on both an image set and an affinity graph built over the set, naturally yielding similarity-preserving binary embedding. GCNH fundamentally differs from conventional graph hashing methods which adopt an affinity graph as the only learning guidance in an objective function to pursue the binary embedding. As the core ingredient of GCNH, we introduce an intuitive asymmetric graph convolutional (AGC) layer to simultaneously convolve the anchor graph, input data, and convolutional filters. By virtue of the AGC layer, GCNH well addresses the issues of scalability and out-of-sample extension when leveraging affinity graphs for hashing. As a use case of our GCNH, we particularly study the semisupervised hashing scenario in this paper. Comprehensive image retrieval evaluations on the CIFAR-10, NUS-WIDE, and ImageNet datasets demonstrate the consistent advantages of GCNH over the state-of-the-art methods given limited labeled data.

15.
IEEE Trans Cybern ; 50(9): 4157-4168, 2020 Sep.
Article in English | MEDLINE | ID: mdl-31603830

ABSTRACT

Unsupervised image hashing has recently gained significant momentum due to the scarcity of reliable supervision knowledge, such as class labels and pairwise relationship. Previous unsupervised methods heavily rely on constructing sufficiently large affinity matrix for exploring the geometric structure of data. Nevertheless, due to lack of adequately preserving the intrinsic information of original visual data, satisfactory performance can hardly be achieved. In this article, we propose a novel approach, called bidirectional discrete matrix factorization hashing (BDMFH), which alternates two mutually promoted processes of 1) learning binary codes from data and 2) recovering data from the binary codes. In particular, we design the inverse factorization model, which enforces the learned binary codes inheriting intrinsic structure from the original visual data. Moreover, we develop an efficient discrete optimization algorithm for the proposed BDMFH. Comprehensive experimental results on three large-scale benchmark datasets show that the proposed BDMFH not only significantly outperforms the state-of-the-arts but also provides the satisfactory computational efficiency.

16.
Article in English | MEDLINE | ID: mdl-31725381

ABSTRACT

Human actions involve a wide variety and a large number of categories, which leads to a big challenge in action recognition. However, according to similarities on human body poses, scenes, interactive objects, human actions can be grouped into some semantic groups, i.e sports, cooking, etc. Therefore, in this paper, we propose a novel approach which recognizes human actions from coarse to fine. Taking full advantage of contributions from high-level semantic contexts, a context knowledge map guided recognition method is designed to realize the coarse-to-fine procedure. In the approach, we define semantic contexts with interactive objects, scenes and body motions in action videos, and build a context knowledge map to automatically define coarse-grained groups. Then fine-grained classifiers are proposed to realize accurate action recognition. The coarse-to-fine procedure narrows action categories in target classifiers, so it is beneficial to improving recognition performance. We evaluate the proposed approach on the CCV, the HMDB-51, and the UCF101 database. Experiments verify its significant effectiveness, on average, improving more than 5% of recognition precisions than current approaches. Compared with the state-of-the-art, it also obtains outstanding performance. The proposed approach achieves higher accuracies of 93.1%, 95.4% and 74.5% in the CCV, the UCF-101 and the HMDB51 database, respectively.

17.
Article in English | MEDLINE | ID: mdl-30794175

ABSTRACT

Zero-shot learning aims to classify visual instances from unseen classes in the absence of training examples. This is typically achieved by directly mapping visual features to a semantic embedding space of classes (e.g., attributes or word vectors), where the similarity between the two modalities can be readily measured. However, the semantic space may not be reliable for recognition due to the noisy class embeddings or visual bias problem. In this work, we propose a novel Binary embedding based Zero-Shot Learning (BZSL) method, which recognizes visual instances from unseen classes through an intermediate discriminative Hamming space. Specifically, BZSL jointly learns two binary coding functions to encode both visual instances and class embeddings into the Hamming space, which well alleviates the visual-semantic bias problem. As a desiring property, classifying an unseen instance thereby can be efficiently done by retrieving its nearest-class codes with minimal Hamming distance. During training, by introducing two auxiliary variables for the coding functions, we formulate an equivalent correlation maximization problem, which admits an analytical solution. The resulting algorithm thus enjoys both highly efficient training and scalable novel class inferring. Extensive experiments on four benchmark datasets, including the full ImageNet Fall 2011 dataset with over 20K unseen classes, demonstrate the superiority of our method on the zero-shot learning task. Particularly, we show that increasing the binary embedding dimension can inevitably improve the recognition accuracy.

18.
IEEE Trans Cybern ; 49(7): 2631-2641, 2019 Jul.
Article in English | MEDLINE | ID: mdl-29993730

ABSTRACT

Video captioning has been attracting broad research attention in the multimedia community. However, most existing approaches heavily rely on static visual information or partially capture the local temporal knowledge (e.g., within 16 frames), thus hardly describing motions accurately from a global view. In this paper, we propose a novel video captioning framework, which integrates bidirectional long-short term memory (BiLSTM) and a soft attention mechanism to generate better global representations for videos as well as enhance the recognition of lasting motions in videos. To generate video captions, we exploit another long-short term memory as a decoder to fully explore global contextual information. The benefits of our proposed method are two fold: 1) the BiLSTM structure comprehensively preserves global temporal and visual information and 2) the soft attention mechanism enables a language decoder to recognize and focus on principle targets from the complex content. We verify the effectiveness of our proposed video captioning framework on two widely used benchmarks, that is, microsoft video description corpus and MSR-video to text, and the experimental results demonstrate the superiority of the proposed approach compared to several state-of-the-art methods.

19.
IEEE Trans Cybern ; 49(3): 781-791, 2019 Mar.
Article in English | MEDLINE | ID: mdl-29993970

ABSTRACT

Most learning-based hashing algorithms leverage sample-to-sample similarities, such as neighborhood structure, to generate binary codes, which achieve promising results for image retrieval. This type of methods are referred to as instance-level encoding. However, it is nontrivial to define a scalar to represent sample-to-sample similarity encoding the semantic labels and the data structure. To address this issue, in this paper, we seek to use a class-level encoding method, which encodes the class-to-class relationship, to take the semantic information of classes into consideration. Based on these two encodings, we propose a novel framework, error correcting input and output (EC-IO) coding, which does class-level and instance-level encoding under a unified mapping space. Our proposed model contains two major components, which are distribution preservation and error correction. With these two components, our model maps the input feature of samples and the output code of classes into a unified space to encode the intrinsic structure of data and semantic information of classes simultaneously. Under this framework, we present our hashing model, EC-IO hashing (EC-IOH), by approximating the mapping space with the Hamming space. Extensive experiments are conducted to evaluate the retrieval performance, and EC-IOH exhibits superior and competitive performances comparing with popular supervised and unsupervised hashing methods.

20.
IEEE Trans Pattern Anal Mach Intell ; 41(7): 1774-1782, 2019 Jul.
Article in English | MEDLINE | ID: mdl-29994652

ABSTRACT

Clustering is a long-standing important research problem, however, remains challenging when handling large-scale image data from diverse sources. In this paper, we present a novel Binary Multi-View Clustering (BMVC) framework, which can dexterously manipulate multi-view image data and easily scale to large data. To achieve this goal, we formulate BMVC by two key components: compact collaborative discrete representation learning and binary clustering structure learning, in a joint learning framework. Specifically, BMVC collaboratively encodes the multi-view image descriptors into a compact common binary code space by considering their complementary information; the collaborative binary representations are meanwhile clustered by a binary matrix factorization model, such that the cluster structures are optimized in the Hamming space by pure, extremely fast bit-operations. For efficiency, the code balance constraints are imposed on both binary data representations and cluster centroids. Finally, the resulting optimization problem is solved by an alternating optimization scheme with guaranteed fast convergence. Extensive experiments on four large-scale multi-view image datasets demonstrate that the proposed method enjoys the significant reduction in both computation and memory footprint, while observing superior (in most cases) or very competitive performance, in comparison with state-of-the-art clustering methods.

SELECTION OF CITATIONS
SEARCH DETAIL
...