Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 38
Filter
Add more filters










Publication year range
1.
Article in English | MEDLINE | ID: mdl-38530719

ABSTRACT

Weakly supervised temporal action localization (TAL) aims to localize the action instances in untrimmed videos using only video-level action labels. Without snippet-level labels, this task should be hard to distinguish all snippets with accurate action/background categories. The main difficulties are the large variations brought by the unconstraint background snippets and multiple subactions in action snippets. The existing prototype model focuses on describing snippets by covering them with clusters (defined as prototypes). In this work, we argue that the clustered prototype covering snippets with simple variations still suffers from the misclassification of the snippets with large variations. We propose an ensemble prototype network (EPNet), which ensembles prototypes learned with consensus-aware clustering. The network stacks a consensus prototype learning (CPL) module and an ensemble snippet weight learning (ESWL) module as one stage and extends one stage to multiple stages in an ensemble learning way. The CPL module learns the consensus matrix by estimating the similarity of clustering labels between two successive clustering generations. The consensus matrix optimizes the clustering to learn consensus prototypes, which can predict the snippets with consensus labels. The ESWL module estimates the weights of the misclassified snippets using the snippet-level loss. The weights update the posterior probabilities of the snippets in the clustering to learn prototypes in the next stage. We use multiple stages to learn multiple prototypes, which can cover the snippets with large variations for accurate snippet classification. Extensive experiments show that our method achieves the state-of-the-art weakly supervised TAL methods on two benchmark datasets, that is, THUMOS'14, ActivityNet v1.2, and ActivityNet v1.3 datasets.

2.
IEEE Trans Image Process ; 33: 1574-1587, 2024.
Article in English | MEDLINE | ID: mdl-38335089

ABSTRACT

Group activity recognition aims to identify a consistent group activity from different actions performed by respective individuals. Most existing methods focus on learning the interaction between each two individuals (i.e., second-order interaction). In this work, we argue that the second-order interactive relation is insufficient to address this task. We propose a third-order active factor graph network, which models the third-order interaction in each pair of three active individuals. At first, to alleviate the noisy individual actions, we select active individuals by measuring each individual's influence. The individuals with the top-k largest influence weights are selected as active individuals. Then, for each three-individuals pair, we build a new factor node and contact the factor node with these individual nodes. In other words, we extend the base second-order interactive graph to a new third-order interactive graph, which is defined as factor graph. Next, we design a two-branch factor graph network, in which one branch is to consider all individuals (denoted as full factor graph) and the other one takes the active individuals into consideration (denoted as active factor graph). We leverage both the active and full factor graphs comprehensively for group activity recognition. Besides, to enforce group consistency, a consistency-aware reasoning module is designed with two penalty terms, which describe the inconsistency between individual actions and group activity respectively. Extensive experiments demonstrate that our method achieves state-of-the-art performance on four benchmark datasets, i.e., Volleyball, Collective Activity, Collective Activity Extended, and SoccerNet-v3 datasets. Visualization results further validate the interpretability of our method.

3.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13265-13280, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37402185

ABSTRACT

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 11824-11841, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37167050

ABSTRACT

It is often the case that data are with multiple views in real-world applications. Fully exploring the information of each view is significant for making data more representative. However, due to various limitations and failures in data collection and pre-processing, it is inevitable for real data to suffer from view missing and data scarcity. The coexistence of these two issues makes it more challenging to achieve the pattern classification task. Currently, to our best knowledge, few appropriate methods can well-handle these two issues simultaneously. Aiming to draw more attention from the community to this challenge, we propose a new task in this paper, called few-shot partial multi-view learning, which focuses on overcoming the negative impact of the view-missing issue in the low-data regime. The challenges of this task are twofold: (i) it is difficult to overcome the impact of data scarcity under the interference of missing views; (ii) the limited number of data exacerbates information scarcity, thus making it harder to address the view-missing issue in turn. To address these challenges, we propose a new unified Gaussian dense-anchoring method. The unified dense anchors are learned for the limited partial multi-view data, thereby anchoring them into a unified dense representation space where the influence of data scarcity and view missing can be alleviated. We conduct extensive experiments to evaluate our method. The results on Cub-googlenet-doc2vec, Handwritten, Caltech102, Scene15, Animal, ORL, tieredImagenet, and Birds-200-2011 datasets validate its effectiveness. The codes will be released at https://github.com/zhouyuan888888/UGDA.

5.
IEEE Trans Neural Netw Learn Syst ; 34(3): 1367-1379, 2023 Mar.
Article in English | MEDLINE | ID: mdl-34464265

ABSTRACT

Spatiotemporal attention learning for video question answering (VideoQA) has always been a challenging task, where existing approaches treat the attention parts and the nonattention parts in isolation. In this work, we propose to enforce the correlation between the attention parts and the nonattention parts as a distance constraint for discriminative spatiotemporal attention learning. Specifically, we first introduce a novel attention-guided erasing mechanism in the traditional spatiotemporal attention to obtain multiple aggregated attention features and nonattention features and then learn to separate the attention and the nonattention features with an appropriate distance. The distance constraint is enforced by a metric learning loss, without increasing the inference complexity. In this way, the model can learn to produce more discriminative spatiotemporal attention distribution on videos, thus enabling more accurate question answering. In order to incorporate the multiscale spatiotemporal information that is beneficial for video understanding, we additionally develop a pyramid variant on basis of the proposed approach. Comprehensive ablation experiments are conducted to validate the effectiveness of our approach, and state-of-the-art performance is achieved on several widely used datasets for VideoQA.

6.
IEEE Trans Cybern ; 53(12): 7749-7759, 2023 Dec.
Article in English | MEDLINE | ID: mdl-36194716

ABSTRACT

Major depressive disorder (MDD) is one of the most common and severe mental illnesses, posing a huge burden on society and families. Recently, some multimodal methods have been proposed to learn a multimodal embedding for MDD detection and achieved promising performance. However, these methods ignore the heterogeneity/homogeneity among various modalities. Besides, earlier attempts ignore interclass separability and intraclass compactness. Inspired by the above observations, we propose a graph neural network (GNN)-based multimodal fusion strategy named modal-shared modal-specific GNN, which investigates the heterogeneity/homogeneity among various psychophysiological modalities as well as explores the potential relationship between subjects. Specifically, we develop a modal-shared and modal-specific GNN architecture to extract the inter/intramodal characteristics. Furthermore, a reconstruction network is employed to ensure fidelity within the individual modality. Moreover, we impose an attention mechanism on various embeddings to obtain a multimodal compact representation for the subsequent MDD detection task. We conduct extensive experiments on two public depression datasets and the favorable results demonstrate the effectiveness of the proposed algorithm.


Subject(s)
Depressive Disorder, Major , Humans , Depression , Neural Networks, Computer , Algorithms , Learning
7.
Article in English | MEDLINE | ID: mdl-35759588

ABSTRACT

A recurrent neural network (RNN) has shown powerful performance in tackling various natural language processing (NLP) tasks, resulting in numerous powerful models containing both RNN neurons and feedforward neurons. On the other hand, the deep structure of RNN has heavily restricted its implementation on mobile devices, where quite a few applications involve NLP tasks. Magnitude-based pruning (MP) is a promising way to address such a challenge. However, the existing MP methods are mostly designed for feedforward neural networks that do not involve a recurrent structure, and, thus, have performed less satisfactorily on pruning models containing RNN layers. In this article, a novel stage-wise MP method is proposed by explicitly taking the featured recurrent structure of RNN into account, which can effectively prune feedforward layers and RNN layers, simultaneously. The connections of neural networks are first grouped into three types according to how they are intersected with recurrent neurons. Then, an optimization-based pruning method is applied to compress each group of connections, respectively. Empirical studies show that the proposed method performs significantly better than the commonly used RNN pruning methods; i.e., up to 96.84% connections are pruned with little or even no degradation of precision indicators on the testing datasets.

8.
IEEE Trans Image Process ; 31: 3414-3429, 2022.
Article in English | MEDLINE | ID: mdl-35503833

ABSTRACT

Metric-based few-shot learning categorizes unseen query instances by measuring their distance to the categories appearing in the given support set. To facilitate distance measurement, prototypes are used to approximate the representations of categories. However, we find prototypical representations are generally not discriminative enough to represent the discrepancy of inter-categorical distribution of queries, thereby limiting the classification accuracy. To overcome this issue, we propose a new Progressive Hierarchical-Refinement (PHR) method, which effectively refines the discrimination of prototypes by conducting the Progressive Discrimination Maximization strategy based on the hierarchical feature representations. Specifically, we first encode supports and queries into the representation space of spatial level, global level, and semantic level. Then, the refining coefficients are constructed by exploring the metric information contained in these hierarchical embedding spaces simultaneously. Under the guidance of the refining coefficients, the meta-refining loss progressively maximizes the discrimination degree of inter-categorical prototypical representations. In addition, the refining vectors are adopted to further enhance the representations of prototypes. In this way, the metric-based classification can be more accurate. Our PHR method shows the competitive performance on the miniImagenet, CIFAR-FS, FC100, and CUB datasets. Moreover, PHR presents good compatibility. It can be incorporated with other few-shot learning models, making them more accurate.

9.
Article in English | MEDLINE | ID: mdl-35259119

ABSTRACT

Nowadays, vision-based computing tasks play an important role in various real-world applications. However, many vision computing tasks, e.g., semantic segmentation, are usually computationally expensive, posing a challenge to the computing systems that are resource-constrained but require fast response speed. Therefore, it is valuable to develop accurate and real-time vision processing models that only require limited computational resources. To this end, we propose the spatial-detail guided context propagation network (SGCPNet) for achieving real-time semantic segmentation. In SGCPNet, we propose the strategy of spatial-detail guided context propagation. It uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed. In this way, the need for maintaining high-resolution features along the network is freed, therefore largely improving the model efficiency. On the other hand, due to the effective reconstruction of spatial details, the segmentation accuracy can be still preserved. In the experiments, we validate the effectiveness and efficiency of the proposed SGCPNet model. On the Cityscapes dataset, for example, our SGCPNet achieves 69.5% mIoU segmentation accuracy, while its speed reaches 178.5 FPS on 768 x 1536 images on a GeForce GTX 1080 Ti GPU card. In addition, SGCPNet is very lightweight and only contains 0.61 M parameters. The code will be released at https://github.com/zhouyuan888888/SGCPNet.

10.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 684-696, 2022 02.
Article in English | MEDLINE | ID: mdl-30990419

ABSTRACT

Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RvG-Tree: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees.RvG-Tree can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning.


Subject(s)
Artificial Intelligence , Language , Algorithms , Humans , Learning
11.
IEEE Trans Image Process ; 30: 8410-8425, 2021.
Article in English | MEDLINE | ID: mdl-34596539

ABSTRACT

This paper strives to predict fine-grained fashion similarity. In this similarity paradigm, one should pay more attention to the similarity in terms of a specific design/attribute between fashion items. For example, whether the collar designs of the two clothes are similar. It has potential value in many fashion related applications, such as fashion copyright protection. To this end, we propose an Attribute-Specific Embedding Network (ASEN) to jointly learn multiple attribute-specific embeddings, thus measure the fine-grained similarity in the corresponding space. The proposed ASEN is comprised of a global branch and a local branch. The global branch takes the whole image as input to extract features from a global perspective, while the local branch takes as input the zoomed-in region-of-interest (RoI) w.r.t. the specified attribute thus able to extract more fine-grained features. As the global branch and the local branch extract the features from different perspectives, they are complementary to each other. Additionally, in each branch, two attention modules, i.e., Attribute-aware Spatial Attention and Attribute-aware Channel Attention, are integrated to make ASEN be able to locate the related regions and capture the essential patterns under the guidance of the specified attribute, thus make the learned attribute-specific embeddings better reflect the fine-grained similarity. Extensive experiments on three fashion-related datasets, i.e., FashionAI, DARN, and DeepFashion, show the effectiveness of ASEN for fine-grained fashion similarity prediction and its potential for fashion reranking. Code and data are available at https://github.com/maryeon/asenpp.

12.
Article in English | MEDLINE | ID: mdl-34252033

ABSTRACT

Deep multiview clustering methods have achieved remarkable performance. However, all of them failed to consider the difficulty labels (uncertainty of ground truth for training samples) over multiview samples, which may result in a nonideal clustering network for getting stuck into poor local optima during the training process; worse still, the difficulty labels from the multiview samples are always inconsistent, and such a fact makes it even more challenging to handle. In this article, we propose a novel deep adversarial inconsistent cognitive sampling (DAICS) method for multiview progressive subspace clustering. A multiview binary classification (easy or difficult) loss and a feature similarity loss are proposed to jointly learn a binary classifier and a deep consistent feature embedding network, throughout an adversarial minimax game over difficulty labels of multiview consistent samples. We develop a multiview cognitive sampling strategy to select the input samples from easy to difficult for multiview clustering network training. However, the distributions of easy and difficult samples are mixed together, hence not trivial to achieve the goal. To resolve it, we define a sampling probability with a theoretical guarantee. Based on that, a golden section mechanism is further designed to generate a sample set boundary to progressively select the samples with varied difficulty labels via a gate unit, which is utilized to jointly learn a multiview common progressive subspace and clustering network for more efficient clustering. Experimental results on four real-world datasets demonstrate the superiority of DAICS over state-of-the-art methods.

13.
IEEE Trans Image Process ; 30: 4894-4904, 2021.
Article in English | MEDLINE | ID: mdl-33945476

ABSTRACT

Deep convolutional neural networks have largely benefited computer vision tasks. However, the high computational complexity limits their real-world applications. To this end, many methods have been proposed for efficient network learning, and applications in portable mobile devices. In this paper, we propose a novel Moving-Mobile-Network, named M2Net, for landmark recognition, equipped each landmark image with located geographic information. We intuitively find that M2Net can essentially promote the diversity of the inference path (selected blocks subset) selection, so as to enhance the recognition accuracy. The above intuition is achieved by our proposed reward function with the input of geo-location and landmarks. We also find that the performance of other portable networks can be improved via our architecture. We construct two landmark image datasets, with each landmark associated with geographic information, over which we conduct extensive experiments to demonstrate that M2Net achieves improved recognition accuracy with comparable complexity.

14.
IEEE Trans Image Process ; 30: 2758-2770, 2021.
Article in English | MEDLINE | ID: mdl-33476268

ABSTRACT

Video question answering is an important task combining both Natural Language Processing and Computer Vision, which requires a machine to obtain a thorough understanding of the video. Most existing approaches simply capture spatio-temporal information in videos by using a combination of recurrent and convolutional neural networks. Nonetheless, most previous work focus on only salient frames or regions, which normally lacks some significant details, such as potential location and action relations. In this paper, we propose a new method called Graph-based Multi-interaction Network for video question answering. In our model, a new attention mechanism named multi-interaction is designed to capture both element-wise and segment-wise sequence interactions simultaneously, which can be found between and inside the multi-modal inputs. Moreover, we propose a graph-based relation-aware neural network to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves state-of-the-art performance.

15.
IEEE Trans Neural Netw Learn Syst ; 32(3): 1351-1364, 2021 Mar.
Article in English | MEDLINE | ID: mdl-32310794

ABSTRACT

Multioutput regression, referring to simultaneously predicting multiple continuous output variables with a single model, has drawn increasing attention in the machine learning community due to its strong ability to capture the correlations among multioutput variables. The methodology of output space embedding, built upon the low-rank assumption, is now the mainstream for multioutput regression since it can effectively reduce the parameter numbers while achieving effective performance. The existing low-rank methods, however, are sensitive to the noises of both inputs and outputs, referring to the noise problem. In this article, we develop a novel multioutput regression method by simultaneously alleviating input and output noises, namely, robust multioutput regression by alleviating input and output noises (RMoR-Aion), where both the noises of the input and output are exploited by leveraging auxiliary matrices. Furthermore, we propose a prediction output manifold constraint with the correlation information regarding the output variables to further reduce the adversarial effects of the noise. Our empirical studies demonstrate the effectiveness of RMoR-Aion compared with the state-of-the-art baseline methods, and RMoR-Aion is more stable in the settings with artificial noise.

16.
IEEE Trans Neural Netw Learn Syst ; 31(8): 2791-2804, 2020 Aug.
Article in English | MEDLINE | ID: mdl-30676983

ABSTRACT

Matrix factorization (MF) has been widely used to discover the low-rank structure and to predict the missing entries of data matrix. In many real-world learning systems, the data matrix can be very high dimensional but sparse. This poses an imbalanced learning problem since the scale of missing entries is usually much larger than that of the observed entries, but they cannot be ignored due to the valuable negative signal. For efficiency concern, existing work typically applies a uniform weight on missing entries to allow a fast learning algorithm. However, this simplification will decrease modeling fidelity, resulting in suboptimal performance for downstream applications. In this paper, we weight the missing data nonuniformly, and more generically, we allow any weighting strategy on the missing data. To address the efficiency challenge, we propose a fast learning method, for which the time complexity is determined by the number of observed entries in the data matrix rather than the matrix size. The key idea is twofold: 1) we apply truncated singular value decomposition on the weight matrix to get a more compact representation of the weights and 2) we learn MF parameters with elementwise alternating least squares (eALS) and memorize the key intermediate variables to avoid repeating computations that are unnecessary. We conduct extensive experiments on two recommendation benchmarks, demonstrating the correctness, efficiency, and effectiveness of our fast eALS method.

17.
Article in English | MEDLINE | ID: mdl-30629501

ABSTRACT

Learning discriminative representations for unseen person images is critical for person Re-Identification (ReID). Most of current approaches learn deep representations in classification tasks, which essentially minimizes the empirical classification risk on the training set. As shown in our experiments, such representations easily get over-fitted on a discriminative human body part on the training set. To gain the discriminative power on unseen person images, we propose a deep representation learning procedure named Part Loss Network (PL-Net), to minimize both the empirical classification risk on training person images and the representation learning risk on unseen person images. The representation learning risk is evaluated by the proposed part loss, which automatically detects human body parts, and computes the person classification loss on each part separately. Compared with traditional global classification loss, simultaneously considering part loss enforces the deep network to learn representations for different body parts and gain the discriminative power on unseen persons. Experimental results on three person ReID datasets, i.e., Market1501, CUHK03, VIPeR, show that our representation outperforms existing deep representations.

18.
Article in English | MEDLINE | ID: mdl-30106730

ABSTRACT

Correlation filters (CFs) have been applied to visual tracking with success providing excellent performance in terms of accuracy and efficiency. The underlying periodic assumption of the training samples results in their great efficiency when using the fast Fourier transform (FFT), yet it also brings unwanted boundary effects. To address this issue, the recently proposed spatially-regularized discriminative CF (SRDCF) method introduces a Gaussian weight function to regularize the learning filter, yielding favorable performances in accuracy but high computational complexity because the objective of the SRDCF cannot achieve a closed solution via the FFT. Motivated by SRDCF, we present an efficient and effective CF-based tracker using center-biased constraint weights (CBCWs), which improve simultaneously speed and accuracy. Specifically, we first construct a CBCW function by exploiting the symmetry of the Fourier transform. The values of the constraint weights are real in both time and frequency domains, so that the optimization can be directly solved in the frequency domain without any data transformation, thereby greatly reducing its computational complexity. Moreover, according to the average peak-tocorrelation energy value of the CF response, we propose an efficient and effective filter update strategy to handle occlusions during tracking. Extensive experiments on the OTB-2013, OTB- 2015, and VOT2016 benchmarks demonstrate that the proposed tracker significantly outperforms the baseline SRDCF in terms of accuracy and efficiency. Moreover, the proposed method performs favorably against 16 other representative state-of-the-art methods regarding robustness and success rate.

19.
IEEE Trans Image Process ; 27(10): 4933-4944, 2018 Oct.
Article in English | MEDLINE | ID: mdl-29985134

ABSTRACT

As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temporal video features. In this paper, we develop a novel sequential vector of locally aggregated descriptor (VLAD) layer, named SeqVLAD, to combine a trainable VLAD encoding process and the RCNs architecture into a whole framework. In particular, sequential convolutional feature maps extracted from successive video frames are fed into the RCNs to learn soft spatio-temporal assignment parameters, so as to aggregate not only detailed spatial information in separate video frames but also fine motion information in successive video frames. Moreover, we improve the gated recurrent unit (GRU) of RCNs by sharing the input-to-hidden parameters and propose an improved GRU-RCN architecture named shared GRU-RCN (SGRU-RCN). Thus, our SGRU-RCN has a fewer parameters and a less possibility of overfitting. In experiments, we evaluate SeqVLAD with the tasks of video captioning and video action recognition. Experimental results on Microsoft Research Video Description Corpus, Montreal Video Annotation Dataset, UCF101, and HMDB51 demonstrate the effectiveness and good performance of our method.

20.
IEEE Trans Image Process ; 27(7): 3210-3221, 2018 Jul.
Article in English | MEDLINE | ID: mdl-29641401

ABSTRACT

Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework dubbed self-supervised video hashing (SSVH), which is able to capture the temporal nature of videos in an end-to-end learning to hash fashion. We specifically address two central problems: 1) how to design an encoder-decoder architecture to generate binary codes for videos and 2) how to equip the binary codes with the ability of accurate video retrieval. We design a hierarchical binary auto-encoder to model the temporal dependencies in videos with multiple granularities, and embed the videos into binary codes with less computations than the stacked architecture. Then, we encourage the binary codes to simultaneously reconstruct the visual content and neighborhood structure of the videos. Experiments on two real-world data sets show that our SSVH method can significantly outperform the state-of-the-art methods and achieve the current best performance on the task of unsupervised video retrieval.

SELECTION OF CITATIONS
SEARCH DETAIL
...