Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
IEEE Trans Image Process ; 33: 5771-5782, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39361466

RESUMO

Existing methods of multiple human parsing (MHP) apply deep models to learn instance-level representations for segmenting each person into non-overlapped body parts. However, learned representations often contain many spurious correlations that degrade model generalization, leading learned models to be vulnerable to visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causal property integrated parsing model termed CPI-Parser, which is driven by fundamental causal principles involving two causal properties for human parsing (i.e., the causal diversity and the causal invariance). Specifically, we assume that an image is constructed by a mix of causal factors (the characteristics of body parts) and non-causal factors (external contexts), where only the former ones decide the essence of human parsing. Since causal/non-causal factors are unobservable, the proposed CPI-Parser is required to separate key factors that satisfy the causal properties from an image. In this way, the parser is able to rely on causal factors w.r.t relevant evidence rather than non-causal factors w.r.t spurious correlations, thus alleviating model degradation and yielding improved parsing ability. Notably, the CPI-Parser is designed in a flexible way and can be integrated into any existing MHP frameworks. Extensive experiments conducted on three widely used benchmarks demonstrate the effectiveness and generalizability of our method. Code and models are released (https://github.com/HAG-uestc/CPI-Parser) for research purpose.


Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador , Humanos , Processamento de Imagem Assistida por Computador/métodos , Aprendizado Profundo
2.
Artigo em Inglês | MEDLINE | ID: mdl-38194384

RESUMO

Unsupervised anomaly detection (UAD) attracts a lot of research interest and drives widespread applications, where only anomaly-free samples are available for training. Some UAD applications intend to locate the anomalous regions further even without any anomaly information. Although the absence of anomalous samples and annotations deteriorates the UAD performance, an inconspicuous, yet powerful statistics model, the normalizing flows, is appropriate for anomaly detection (AD) and localization in an unsupervised fashion. The flow-based probabilistic models, only trained on anomaly-free data, can efficiently distinguish unpredictable anomalies by assigning them much lower likelihoods than normal data. Nevertheless, the size variation of unpredictable anomalies introduces another inconvenience to the flow-based methods for high-precision AD and localization. To generalize the anomaly size variation, we propose a novel multiscale flow-based framework (MSFlow) composed of asymmetrical parallel flows followed by a fusion flow to exchange multiscale perceptions. Moreover, different multiscale aggregation strategies are adopted for image-wise AD and pixel-wise anomaly localization according to the discrepancy between them. The proposed MSFlow is evaluated on three AD datasets, significantly outperforming existing methods. Notably, on the challenging MVTec AD benchmark, our MSFlow achieves a new state-of-the-art (SOTA) with a detection AUORC score of up to 99.7%, localization AUCROC score of 98.8% and PRO score of 97.1%.

3.
IEEE Trans Image Process ; 33: 1768-1781, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38442063

RESUMO

In real-world datasets, visually related images often form clusters, and these clusters can be further grouped into larger categories with more general semantics. These inherent hierarchical structures can help capture the underlying distribution of data, making it easier to learn robust hash codes that lead to better retrieval performance. However, existing methods fail to make use of this hierarchical information, which in turn prevents the accurate preservation of relationships between data points in the learned hash codes, resulting in suboptimal performance. In this paper, our focus is on applying visual hierarchical information to self-supervised hash learning and addressing three key challenges, including the construction, embedding, and exploitation of visual hierarchies. We propose a new self-supervised hashing method named Hierarchical Hyperbolic Contrastive Hashing (HHCH), making breakthroughs in three aspects. First, we propose to embed continuous hash codes into hyperbolic space for accurate semantic expression since embedding hierarchies in the hyperbolic space generates less distortion than in the hyper-sphere or Euclidean space. Second, we update the K-Means algorithm to make it run in the hyperbolic space. The proposed hierarchical hyperbolic K-Means algorithm can achieve the adaptive construction of hierarchical semantic structures. Last but not least, to exploit the hierarchical semantic structures in hyperbolic space, we propose the hierarchical contrastive learning algorithm, including hierarchical instance-wise and hierarchical prototype-wise contrastive learning. Extensive experiments on four benchmark datasets demonstrate that the proposed method outperforms state-of-the-art self-supervised hashing methods. Our codes are released at https://github.com/HUST-IDSM-AI/HHCH.git.

4.
IEEE Trans Image Process ; 33: 2558-2571, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38530729

RESUMO

Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme is developed to capture the modality-invariant components as target poisoning areas, where well-designed trigger patterns injected into these regions can be efficiently recognized by the victim models. This strategy is adapted to different image-text cross-modal models, making our framework available to various attack scenarios. Furthermore, for generating poisoned samples of high stealthiness, we conceive modality-specific generators for visual and linguistic modalities that facilitate hiding explicit trigger patterns in modality-invariant regions. To the best of our knowledge, BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework. Comprehensive experimental evaluations on two typical applications, i.e., cross-modal retrieval and VQA, demonstrate the effectiveness and generalization of our method under multiple kinds of attack scenarios. Moreover, we show that BadCM can robustly evade existing backdoor defenses. Our code is available at https://github.com/xandery-geek/BadCM.

5.
IEEE Trans Image Process ; 32: 43-56, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36459603

RESUMO

How to avoid biased predictions is an important and active research question in scene graph generation (SGG). Current state-of-the-art methods employ debiasing techniques such as resampling and causality analysis. However, the role of intrinsic cues in the features causing biased training has remained under-explored. In this paper, for the first time, we make the surprising observation that object identity information, in the form of object label embeddings (e.g. GLOVE), is principally responsible for biased predictions. We empirically observe that, even without any visual features, a number of recent SGG models can produce comparable or even better results solely from object label embeddings. Motivated by this insight, we propose to leverage a conditional variational auto-encoder to decouple the entangled visual features into two meaningful components: the object's intrinsic identity features and the extrinsic, relation-dependent state feature. We further develop two compositional learning strategies on the relation and object levels to mitigate the data scarcity issue of rare relations. On the two benchmark datasets Visual Genome and GQA, we conduct extensive experiments on the three scenarios, i.e., conventional, few-shot and zero-shot SGG. Results consistently demonstrate that our proposed Decomposition and Composition (DeC) method effectively alleviates the biases in the relation prediction. Moreover, DeC is model-free, and it significantly improves the performance of recent SGG models, establishing new state-of-the-art performance.

6.
IEEE Trans Image Process ; 32: 6274-6288, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37948145

RESUMO

Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.


Assuntos
Reconhecimento Psicológico , Percepção Visual , Humanos
7.
IEEE Trans Neural Netw Learn Syst ; 34(8): 4791-4802, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-34878979

RESUMO

Learning accurate low-dimensional embeddings for a network is a crucial task as it facilitates many downstream network analytics tasks. For large networks, the trained embeddings often require a significant amount of space to store, making storage and processing a challenge. Building on our previous work on semisupervised network embedding, we develop d-SNEQ, a differentiable DNN-based quantization method for network embedding. d-SNEQ incorporates a rank loss to equip the learned quantization codes with rich high-order information and is able to substantially compress the size of trained embeddings, thus reducing storage footprint and accelerating retrieval speed. We also propose a new evaluation metric, path prediction, to fairly and more directly evaluate the model performance on the preservation of high-order information. Our evaluation on four real-world networks of diverse characteristics shows that d-SNEQ outperforms a number of state-of-the-art embedding methods in link prediction, path prediction, node classification, and node recommendation while being far more space- and time-efficient.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13921-13940, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37788219

RESUMO

The performance of current Scene Graph Generation (SGG) models is severely hampered by hard-to-distinguish predicates, e.g., "woman-on/standing on/walking on-beach". As general SGG models tend to predict head predicates and re-balancing strategies prefer tail categories, none of them can appropriately handle hard-to-distinguish predicates. To tackle this issue, inspired by fine-grained image classification, which focuses on differentiating hard-to-distinguish objects, we propose an Adaptive Fine-Grained Predicates Learning (FGPL-A) which aims at differentiating hard-to-distinguish predicates for SGG. First, we introduce an Adaptive Predicate Lattice (PL-A) to figure out hard-to-distinguish predicates, which adaptively explores predicate correlations in keeping with model's dynamic learning pace. Practically, PL-A is initialized from SGG dataset, and gets refined by exploring model's predictions of current mini-batch. Utilizing PL-A, we propose an Adaptive Category Discriminating Loss (CDL-A) and an Adaptive Entity Discriminating Loss (EDL-A), which progressively regularize model's discriminating process with fine-grained supervision concerning model's dynamic learning status, ensuring balanced and efficient learning process. Extensive experimental results show that our proposed model-agnostic strategy significantly boosts performance of benchmark models on VG-SGG and GQA-SGG datasets by up to 175% and 76% on Mean Recall@100, achieving new state-of-the-art performance. Moreover, experiments on Sentence-to-Graph Retrieval and Image Captioning tasks further demonstrate practicability of our method.

9.
Artigo em Inglês | MEDLINE | ID: mdl-37922164

RESUMO

Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection that recognizes inputs as ID/OOD according to their relative distances to the training data of ID classes. Previous approaches calculate pairwise distances relying only on global image representations, which can be sub-optimal as the inevitable background clutter and intra-class variation may drive image-level representations from the same ID class far apart in a given representation space. In this work, we overcome this challenge by proposing Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details of images to maximally benefit OOD detection. Specifically, we first find that existing models pretrained by off-the-shelf cross-entropy or contrastive losses are incompetent to capture valuable local representations for MODE, due to the scale-discrepancy between the ID training and OOD detection processes. To mitigate this issue and encourage locally discriminative representations in ID training, we propose Attention-based Local PropAgation (ALPA), a trainable objective that exploits a cross-attention mechanism to align and highlight the local regions of the target objects for pairwise examples. During test-time OOD detection, a Cross-Scale Decision (CSD) function is further devised on the most discriminative multi-scale representations to distinguish ID/OOD data more faithfully. We demonstrate the effectiveness and flexibility of MODE on several benchmarks - on average, MODE outperforms the previous state-of-the-art by up to 19.24% in FPR, 2.77% in AUROC. Code is available at https://github.com/JimZAI/MODE-OOD.

10.
IEEE Trans Image Process ; 32: 2399-2412, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37015122

RESUMO

Multi-Codebook Quantization (MCQ) is a generalized version of existing codebook-based quantizations for Approximate Nearest Neighbor (ANN) search. Specifically, MCQ picks one codeword for each sub-codebook independently and takes the sum of picked codewords to approximate the original vector. The objective function involves no constraints, therefore, MCQ theoretically has the potential to achieve the best performance because solutions of other codebook-based quantization methods are all covered by MCQ's solution space under the same codebook size setting. However, finding the optimal solution to MCQ is proved to be NP-hard due to its encoding process, i.e., converting an input vector to a binary code. To tackle this, researchers apply constraints to it to find near-optimal solutions or employ heuristic algorithms that are still time-consuming for encoding. Different from previous approaches, this paper takes the first attempt to find a deep solution to MCQ. The encoding network is designed to be as simple as possible, so the very complex encoding problem becomes simply a feed-forward. Compared with other methods on three datasets, our method shows state-of-the-art performance. Notably, our method is 11× - 38× faster than heuristic algorithms for encoding, which makes it more practical for the real scenery of large-scale retrieval. Our code is publicly available: https://github.com/DeepMCQ/DeepQ.

11.
IEEE Trans Image Process ; 32: 5017-5030, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37186535

RESUMO

Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.

12.
IEEE Trans Neural Netw Learn Syst ; 34(8): 5112-5121, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-34910639

RESUMO

Fine-grained visual classification (FGVC) is challenging due to the interclass similarity and intraclass variation in datasets. In this work, we explore the great merit of complex values in introducing an imaginary part for modeling data uncertainty (e.g., different points on the complex plane can describe the same state) and graph convolutional networks (GCNs) in learning interdependently among classes to simultaneously tackle the above two major challenges. To the end, we propose a novel approach, termed text-assisted complex-valued fusion network (TA-CFN). Specifically, we expand each feature from 1-D real values to 2-D complex value by disassembling feature maps, thereby enabling the extension of traditional deep convolutional neural networks over the complex domain. Then, we fuse the real and imaginary parts of complex features through complex projection and modulus operation. Finally, we build an undirected graph over the object labels with the assistance of a text corpus, and a GCN is learned to map this graph into a set of classifiers. The benefits are in two folds: 1) complex features allow for a richer algebraic structure to better model the large variation within the same category and 2) leveraging the interclass dependencies brought by the GCN to capture key factors of the slight variation among different categories. We conduct extensive experiments to verify that our proposed model can achieve the state-of-the-art performance on two widely used FGVC datasets.

13.
IEEE Trans Image Process ; 32: 6485-6499, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37991910

RESUMO

Existing supervised quantization methods usually learn the quantizers from pair-wise, triplet, or anchor-based losses, which only capture their relationship locally without aligning them globally. This may cause an inadequate use of the entire space and a severe intersection among different semantics, leading to inferior retrieval performance. Furthermore, to enable quantizers to learn in an end-to-end way, current practices usually relax the non-differentiable quantization operation by substituting it with softmax, which unfortunately is biased, leading to an unsatisfying suboptimal solution. To address the above issues, we present Spherical Centralized Quantization (SCQ), which contains a Priori Knowledge based Feature (PKFA) module for the global alignment of feature vectors, and an Annealing Regulation Semantic Quantization (ARSQ) module for low-biased optimization. Specifically, the PKFA module first applies Semantic Center Allocation (SCA) to obtain semantic centers based on prior knowledge, and then adopts Centralized Feature Alignment (CFA) to gather feature vectors based on corresponding semantic centers. The SCA and CFA globally optimize the inter-class separability and intra-class compactness, respectively. After that, the ARSQ module performs a partial-soft relaxation to tackle biases, and an Annealing Regulation Quantization loss for further addressing the local optimal solution. Experimental results show that our SCQ outperforms state-of-the-art algorithms by a large margin (2.1%, 3.6%, 5.5% mAP respectively) on CIFAR-10, NUS-WIDE, and ImageNet with a code length of 8 bits. Codes are publicly available:https://github.com/zzb111/Spherical-Centralized-Quantization.

14.
Artigo em Inglês | MEDLINE | ID: mdl-37163399

RESUMO

Building multi-person pose estimation (MPPE) models that can handle complex foreground and uncommon scenes is an important challenge in computer vision. Aside from designing novel models, strengthening training data is a promising direction but remains largely unexploited for the MPPE task. In this article, we systematically identify the key deficiencies of existing pose datasets that prevent the power of well-designed models from being fully exploited and propose the corresponding solutions. Specifically, we find that the traditional data augmentation techniques are inadequate in addressing the two key deficiencies, imbalanced instance complexity (IC) (evaluated by our new metric IC) and insufficient realistic scenes. To overcome these deficiencies, we propose a model-agnostic full-view data generation (Full-DG) method to enrich the training data from the perspectives of both poses and scenes. By hallucinating images with more balanced pose complexity and richer real-world scenes, Full-DG can help improve pose estimators' robustness and generalizability. In addition, we introduce a plug-and-play adaptive category-aware loss (AC-loss) to alleviate the severe pixel-level imbalance between keypoints and backgrounds (i.e., around 1:600). Full-DG together with AC-loss can be readily applied to both the bottom-up and top-down models to improve their accuracy. Notably, plugging into the representative estimators HigherHRNet and HRNet, our method achieves substantial performance gains of 1.0%-2.9% AP on the COCO benchmark, and 1.0%-5.1% AP on the CrowdPose benchmark.

15.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3311-3328, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35763471

RESUMO

Generating photo-realistic images from labels (e.g., semantic labels or sketch labels) is much more challenging than the general image-to-image translation task, mainly due to the large differences between extremely sparse labels and detail rich images. We propose a general framework Lab2Pix to tackle this issue from two aspects: 1) how to extract useful information from the input; and 2) how to efficiently bridge the gap between the labels and images. Specifically, we propose a Double-Guided Normalization (DG-Norm) to use the input label for semantically guiding activations in normalization layers, and use global features with large receptive fields for differentiating the activations within the same semantic region. To efficiently generate the images, we further propose Label Guided Spatial Co-Attention (LSCA) to encourage the learning of incremental visual information using limited model parameters while storing the well-synthesized part in lower-level features. Accordingly, Hierarchical Perceptual Discriminators with Foreground Enhancement Masks are proposed to toughly work against the generator thus encouraging realistic image generation and a sharp enhancement loss is further introduced for high-quality sharp image generation. We instantiate our Lab2Pix for the task of label-to-image in both unpaired (Lab2Pix-V1) and paired settings (Lab2Pix-V2). Extensive experiments conducted on various datasets demonstrate that our method significantly outperforms state-of-the-art methods quantitatively and qualitatively in both settings.

16.
IEEE Trans Cybern ; 53(11): 7263-7274, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36251898

RESUMO

Part-level attribute parsing is a fundamental but challenging task, which requires the region-level visual understanding to provide explainable details of body parts. Most existing approaches address this problem by adding a regional convolutional neural network (RCNN) with an attribute prediction head to a two-stage detector, in which attributes of body parts are identified from localwise part boxes. However, localwise part boxes with limit visual clues (i.e., part appearance only) lead to unsatisfying parsing results, since attributes of body parts are highly dependent on comprehensive relations among them. In this article, we propose a knowledge-embedded RCNN (KE-RCNN) to identify attributes by leveraging rich knowledge, including implicit knowledge (e.g., the attribute "above-the-hip" for a shirt requires visual/geometry relations of shirt-hip) and explicit knowledge (e.g., the part of "shorts" cannot have the attribute of "hoodie" or "lining"). Specifically, the KE-RCNN consists of two novel components, that is: 1) implicit knowledge-based encoder (IK-En) and 2) explicit knowledge-based decoder (EK-De). The former is designed to enhance part-level representation by encoding part-part relational contexts into part boxes, and the latter one is proposed to decode attributes with a guidance of prior knowledge about part-attribute relations. In this way, the KE-RCNN is plug-and-play, which can be integrated into any two-stage detectors, for example, Attribute-RCNN, Cascade-RCNN, HRNet-based RCNN, and SwinTransformer-based RCNN. Extensive experiments conducted on two challenging benchmarks, for example, Fashionpedia and Kinetics-TPS, demonstrate the effectiveness and generalizability of the KE-RCNN. In particular, it achieves higher improvements over all existing methods, reaching around 3% of AP allIoU+F1 on Fashionpedia and around 4% of Accp on Kinetics-TPS. Code and models are publicly available at: https://github.com/sota-joson/KE-RCNN.

17.
IEEE Trans Image Process ; 31: 5936-5948, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36083958

RESUMO

Video Question Answering (VideoQA), which explores spatial-temporal visual information of videos given a linguistic query, has received unprecedented attention over recent years. One of the main challenges lies in locating relevant visual and linguistic information, and therefore various attention-based approaches are proposed. Despite the impressive progress, two aspects are not fully explored by current methods to get proper attention. Firstly, prior knowledge, which in the human cognitive process plays an important role in assisting the reasoning process of VideoQA, is not fully utilized. Secondly, structured visual information (e.g., object) instead of the raw video is underestimated. To address the above two issues, we propose a Prior Knowledge and Object-sensitive Learning (PKOL) by exploring the effect of prior knowledge and learning object-sensitive representations to boost the VideoQA task. Specifically, we first propose a Prior Knowledge Exploring (PKE) module that aims to acquire and integrate prior knowledge into a question feature for feature enriching, where an information retriever is constructed to retrieve related sentences as prior knowledge from the massive corpus. In addition, we propose an Object-sensitive Representation Learning (ORL) module to generate object-sensitive features by interacting object-level features with frame and clip-level features. Our proposed PKOL achieves consistent improvements on three competitive benchmarks (i.e., MSVD-QA, MSRVTT-QA, and TGIF-QA) and gains state-of-the-art performance. The source code is available at https://github.com/zchoi/PKOL.

18.
IEEE Trans Image Process ; 31: 6694-6706, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36219662

RESUMO

Referring Expression Comprehension (REC) aims to localize an image region of a given object described by a natural-language expression. While promising performance has been demonstrated, existing REC algorithms make a strong assumption that training data feeding into a model are given upfront, which degrades its practicality for real-world scenarios. In this paper, we propose Continual Referring Expression Comprehension (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks. In order to continuously improve the model on sequential tasks without forgetting prior learned knowledge and without repeatedly re-training from a scratch, we propose an effective baseline method named Dual Modular Memorization (DMM), which alleviates the problem of catastrophic forgetting by two memorization modules: Implicit-Memory and Explicit-Memory. Specifically, the former module aims to constrain drastic changes to important parameters learned on old tasks when learning a new task; while the latter module maintains a buffer pool to dynamically select and store representative samples of each seen task for future rehearsal. We create three benchmarks for the new CREC setting, by respectively re-splitting three widely-used REC datasets RefCOCO, RefCOCO+ and RefCOCOg into sequential tasks. Extensive experiments on the constructed benchmarks demonstrate that our DMM method significantly outperforms other alternatives, based on two popular REC backbones. We make the source code and benchmarks publicly available to foster future progress in this field: https://github.com/zackschen/DMM.


Assuntos
Compreensão , Aprendizagem , Algoritmos , Benchmarking
19.
IEEE Trans Cybern ; 52(7): 5961-5972, 2022 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-33710964

RESUMO

Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations for describing the image content abstraction. Existing works have revealed that if the links between objects are given as prior knowledge, the performance of SGG is significantly improved. Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG. Specifically, we first construct an affinity matrix among detected objects to represent the probability of a relationship between two objects. Graph convolution networks (GCNs) over this relation affinity matrix are then used as object encoders, producing relation-regularized representations of objects. With these relation-regularized features, our R2-Net can effectively refine object labels and generate scene graphs. Extensive experiments are conducted on the visual genome dataset for three SGG tasks (i.e., predicate classification, scene graph classification, and scene graph detection), demonstrating the effectiveness of our proposed method. Ablation studies also verify the key roles of our proposed components in performance improvement.

20.
IEEE Trans Image Process ; 31: 202-215, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34710043

RESUMO

Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., objects, actions and events. Meanwhile, content completeness and syntactic consistency play an important role in high-quality language-related video understanding. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations and obtaining syntax-aware video captions. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and the Vision-assist Task contribute to generating descriptions which are not only globally similar to the video content, but also syntax-consistent to the ground-truth description. The key components of our model are general and they can be readily applied to both video captioning and video question answering tasks. Performances for the above tasks on several benchmark datasets validate the effectiveness and superiority of our proposed method compared with the state-of-the-art methods. Codes and models are also released https://github.com/riesling00/HRNAT.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa