Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 40
Filtrar
1.
IEEE Trans Image Process ; 33: 2558-2571, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38530729

RESUMO

Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme is developed to capture the modality-invariant components as target poisoning areas, where well-designed trigger patterns injected into these regions can be efficiently recognized by the victim models. This strategy is adapted to different image-text cross-modal models, making our framework available to various attack scenarios. Furthermore, for generating poisoned samples of high stealthiness, we conceive modality-specific generators for visual and linguistic modalities that facilitate hiding explicit trigger patterns in modality-invariant regions. To the best of our knowledge, BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework. Comprehensive experimental evaluations on two typical applications, i.e., cross-modal retrieval and VQA, demonstrate the effectiveness and generalization of our method under multiple kinds of attack scenarios. Moreover, we show that BadCM can robustly evade existing backdoor defenses. Our code is available at https://github.com/xandery-geek/BadCM.

2.
IEEE Trans Image Process ; 33: 1136-1148, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38300774

RESUMO

The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization. However, without proper constraints, the expanded activation will easily intrude into the background region. In this paper, we propose spatial structure constraints (SSC) for weakly supervised semantic segmentation to alleviate the unwanted object over-activation of attention expansion. Specifically, we propose a CAM-driven reconstruction module to directly reconstruct the input image from deep CAM features, which constrains the diffusion of last-layer object attention by preserving the coarse spatial structure of the image content. Moreover, we propose an activation self-modulation module to refine CAMs with finer spatial structure details by enhancing regional consistency. Without external saliency models to provide background clues, our approach achieves 72.7% and 47.0% mIoU on the PASCAL VOC 2012 and COCO datasets, respectively, demonstrating the superiority of our proposed approach. The source codes and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/SSC.

3.
Artigo em Inglês | MEDLINE | ID: mdl-38376967

RESUMO

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. Whilevarious deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4). DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content, which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM4 dataset. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. To exploit more fine-grained contrastive learning for cross-modal semantic alignment, we further integrate Manipulation-Aware Contrastive Loss with Local View and construct a more advanced model HAMMER++ Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of HAMMER and HAMMER++; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation..

4.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 4850-4865, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38261483

RESUMO

Although stereo image restoration has been extensively studied, most existing work focuses on restoring stereo images with limited horizontal parallax due to the binocular symmetry constraint. Stereo images with unlimited parallax (e.g., large ranges and asymmetrical types) are more challenging in real-world applications and have rarely been explored so far. To restore high-quality stereo images with unlimited parallax, this paper proposes an attention-guided correspondence learning method, which learns both self- and cross-views feature correspondence guided by parallax and omnidirectional attention. To learn cross-view feature correspondence, a Selective Parallax Attention Module (SPAM) is proposed to interact with cross-view features under the guidance of parallax attention that adaptively selects receptive fields for different parallax ranges. Furthermore, to handle asymmetrical parallax, we propose a Non-local Omnidirectional Attention Module (NOAM) to learn the non-local correlation of both self- and cross-view contexts, which guides the aggregation of global contextual features. Finally, we propose an Attention-guided Correspondence Learning Restoration Network (ACLRNet) upon SPAMs and NOAMs to restore stereo images by associating the features of two views based on the learned correspondence. Extensive experiments on five benchmark datasets demonstrate the effectiveness and generalization of the proposed method on three stereo image restoration tasks including super-resolution, denoising, and compression artifact reduction.

5.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4174-4187, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38236680

RESUMO

Query-oriented micro-video summarization task aims to generate a concise sentence with two properties: (a) summarizing the main semantic of the micro-video and (b) being expressed in the form of search queries to facilitate retrieval. Despite its enormous application value in the retrieval area, this direction has barely been explored. Previous studies of summarization mostly focus on the content summarization for traditional long videos. Directly applying these studies is prone to gain unsatisfactory results because of the unique features of micro-videos and queries: diverse entities and complex scenes within a short time, semantic gaps between modalities, and various queries in distinct expressions. To specifically adapt to these characteristics, we propose a query-oriented micro-video summarization model, dubbed QMS. It employs an encoder-decoder-based transformer architecture as the skeleton. The multi-modal (visual and textual) signals are passed through two modal-specific encoders to obtain their representations, followed by an entity-aware representation learning module to identify and highlight critical entity information. As to the optimization, regarding the large semantic gaps between modalities, we assign different confidence scores according to their semantic relevance in the optimization process. Additionally, we develop a novel strategy to sample the effective target query among the diverse query set with various expressions. Extensive experiments demonstrate the superiority of the QMS scheme, on both the summarization and retrieval tasks, over several state-of-the-art methods.

6.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3665-3678, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38145530

RESUMO

The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multiple query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multiple matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a CLIP-Transformer based muLtI-factor Matching Network (LIMN), which consists of three key modules: disentanglement-based latent factor tokens mining, dual aggregation-based matching token learning, and dual query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a weakly-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on four datasets, including FashionIQ, Shoes, CIRR, and Fashion200 K, show that our proposed LIMN and LIMN+ significantly surpass the state-of-the-art baselines.

7.
IEEE Trans Image Process ; 32: 5794-5807, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37843991

RESUMO

Talking face generation is the process of synthesizing a lip-synchronized video when given a reference portrait and an audio clip. However, generating a fine-grained talking video is nontrivial due to several challenges: 1) capturing vivid facial expressions, such as muscle movements; 2) ensuring smooth transitions between consecutive frames; and 3) preserving the details of the reference portrait. Existing efforts have only focused on modeling rigid lip movements, resulting in low-fidelity videos with jerky facial muscle deformations. To address these challenges, we propose a novel Fine-gRained mOtioN moDel (FROND), consisting of three components. In the first component, we adopt a two-stream encoder to capture local facial movement keypoints and embed their overall motion context as the global code. In the second component, we design a motion estimation module to predict audio-driven movements. This enables the learning of local key point motion in the continuous trajectory space to achieve smooth temporal facial movements. Additionally, the local and global motions are fused to estimate a continuous dense motion field, resulting in spatially smooth movements. In the third component, we devise a novel implicit image decoder based on an implicit neural network. This decoder recovers high-frequency information from the input image, resulting in a high-fidelity talking face. In summary, the FROND refines the motion trajectories of facial keypoints into a continuous dense motion field, which is followed by a decoder that fully exploits the inherent smoothness of the motion. We conduct quantitative and qualitative model evaluations on benchmark datasets. The experimental results show that our proposed FROND significantly outperforms several state-of-the-art baselines.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 14144-14160, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37669202

RESUMO

Partial person re-identification (ReID) aims to solve the problem of image spatial misalignment due to occlusions or out-of-views. Despite significant progress through the introduction of additional information, such as human pose landmarks, mask maps, and spatial information, partial person ReID remains challenging due to noisy keypoints and impressionable pedestrian representations. To address these issues, we propose a unified attribute-guided collaborative learning scheme for partial person ReID. Specifically, we introduce an adaptive threshold-guided masked graph convolutional network that can dynamically remove untrustworthy edges to suppress the diffusion of noisy keypoints. Furthermore, we incorporate human attributes and devise a cyclic heterogeneous graph convolutional network to effectively fuse cross-modal pedestrian information through intra- and inter-graph interaction, resulting in robust pedestrian representations. Finally, to enhance keypoint representation learning, we design a novel part-based similarity constraint based on the axisymmetric characteristic of the human body. Extensive experiments on multiple public datasets have shown that our model achieves superior performance compared to other state-of-the-art baselines.

9.
IEEE Trans Image Process ; 32: 5537-5549, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37773902

RESUMO

Visual Question Answering (VQA) is fundamentally compositional in nature, and many questions are simply answered by decomposing them into modular sub-problems. The recent proposed Neural Module Network (NMN) employ this strategy to question answering, whereas heavily rest with off-the-shelf layout parser or additional expert policy regarding the network architecture design instead of learning from the data. These strategies result in the unsatisfactory adaptability to the semantically-complicated variance of the inputs, thereby hindering the representational capacity and generalizability of the model. To tackle this problem, we propose a Semantic-aware modUlar caPsulE Routing framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics and refine the discriminative representations for prediction. Particularly, five powerful specialized modules as well as dynamic routers are tailored in each layer of the SUPER network, and the compact routing spaces are constructed such that a variety of customizable routes can be sufficiently exploited and the vision-semantic representations can be explicitly calibrated. We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets, as well as the parametric-efficient advantage. It is worth emphasizing that this work is not to pursue the state-of-the-art results in VQA. Instead, we expect that our model is responsible to provide a novel perspective towards architecture learning and representation calibration for VQA.

10.
IEEE Trans Image Process ; 32: 3836-3846, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37410654

RESUMO

Visual Commonsense Reasoning (VCR), deemed as one challenging extension of Visual Question Answering (VQA), endeavors to pursue a higher-level visual comprehension. VCR includes two complementary processes: question answering over a given image and rationale inference for answering explanation. Over the years, a variety of VCR methods have pushed more advancements on the benchmark dataset. Despite significance of these methods, they often treat the two processes in a separate manner and hence decompose VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is broken, rendering existing efforts less faithful to visual reasoning. To empirically study this issue, we perform some in-depth empirical explorations in terms of both language shortcuts and generalization capability. Based on our findings, we then propose a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes. The key contribution lies in the introduction of a new branch, which serves as a relay to bridge the two processes. Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset. As demonstrated in the experimental results, when equipped with our method, these baselines all achieve consistent and significant performance improvements, evidently verifying the viability of processes coupling.

11.
Artigo em Inglês | MEDLINE | ID: mdl-37216233

RESUMO

The goal of talking face generation is to synthesize a sequence of face images of the specified identity, ensuring the mouth movements are synchronized with the given audio. Recently, image-based talking face generation has emerged as a popular approach. It could generate talking face images synchronized with the audio merely depending on a facial image of arbitrary identity and an audio clip. Despite the accessible input, it forgoes the exploitation of the audio emotion, inducing the generated faces to suffer from emotion unsynchronization, mouth inaccuracy, and image quality deficiency. In this article, we build a bistage audio emotion-aware talking face generation (AMIGO) framework, to generate high-quality talking face videos with cross-modally synced emotion. Specifically, we propose a sequence-to-sequence (seq2seq) cross-modal emotional landmark generation network to generate vivid landmarks, whose lip and emotion are both synchronized with input audio. Meantime, we utilize a coordinated visual emotion representation to improve the extraction of the audio one. In stage two, a feature-adaptive visual translation network is designed to translate the synthesized landmarks into facial images. Concretely, we proposed a feature-adaptive transformation module to fuse the high-level representations of landmarks and images, resulting in significant improvement in image quality. We perform extensive experiments on the multi-view emotional audio-visual dataset (MEAD) and crowd-sourced emotional multimodal actors dataset (CREMA-D) benchmark datasets, demonstrating that our model outperforms state-of-the-art benchmarks.

12.
IEEE Trans Image Process ; 32: 2215-2227, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37040248

RESUMO

Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification, but it does not work well when transferred directly to the video domain since it only utilizes the single RGB modality, which contains insufficient motion information. Moreover, it only leverages highly-confident pseudo-labels to explore consistency between strongly-augmented and weakly-augmented samples, resulting in limited supervised signals, long training time, and insufficient feature discriminability. To address the above issues, we propose neighbor-guided consistent and contrastive learning (NCCL), which takes both RGB and temporal gradient (TG) as input and is based on the teacher-student framework. Due to the limitation of labelled samples, we first incorporate neighbors information as a self-supervised signal to explore the consistent property, which compensates for the lack of supervised signals and the shortcoming of long training time of FixMatch. To learn more discriminative feature representations, we further propose a novel neighbor-guided category-level contrastive learning term to minimize the intra-class distance and enlarge the inter-class distance. We conduct extensive experiments on four datasets to validate the effectiveness. Compared with the state-of-the-art methods, our proposed NCCL achieves superior performance with much lower computational cost.

13.
IEEE Trans Neural Netw Learn Syst ; 34(12): 10039-10050, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-35427224

RESUMO

The de facto review-involved recommender systems, using review information to enhance recommendation, have received increasing interest over the past years. Thereinto, one advanced branch is to extract salient aspects from textual reviews (i.e., the item attributes that users express) and combine them with the matrix factorization (MF) technique. However, the existing approaches all ignore the fact that semantically different reviews often include opposite aspect information. In particular, positive reviews usually express aspects that users prefer, while the negative ones describe aspects that users dislike. As a result, it may mislead the recommender systems into making incorrect decisions pertaining to user preference modeling. Toward this end, in this article, we present a review polarity-wise recommender model, dubbed as RPR, to discriminately treat reviews with different polarities. To be specific, in this model, positive and negative reviews are separately gathered and used to model the user-preferred and user-rejected aspects, respectively. Besides, to overcome the imbalance of semantically different reviews, we further develop an aspect-aware importance weighting strategy to align the aspect importance for these two kinds of reviews. Extensive experiments conducted on eight benchmark datasets have demonstrated the superiority of our model when compared with several state-of-the-art review-involved baselines. Moreover, our method can provide certain explanations to real-world rating prediction scenarios.

14.
IEEE Trans Image Process ; 31: 4733-4745, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35793293

RESUMO

Fashion Compatibility Modeling (FCM), which aims to automatically evaluate whether a given set of fashion items makes a compatible outfit, has attracted increasing research attention. Recent studies have demonstrated the benefits of conducting the item representation disentanglement towards FCM. Although these efforts have achieved prominent progress, they still perform unsatisfactorily, as they mainly investigate the visual content of fashion items, while overlooking the semantic attributes of items (e.g., color and pattern), which could largely boost the model performance and interpretability. To address this issue, we propose to comprehensively explore the visual content and attributes of fashion items towards FCM. This problem is non-trivial considering the following challenges: a) how to utilize the irregular attribute labels of items to partially supervise the attribute-level representation learning of fashion items; b) how to ensure the intact disentanglement of attribute-level representations; and c) how to effectively sew the multiple granulairites (i.e, coarse-grained item-level and fine-grained attribute-level) information to enable performance improvement and interpretability. To address these challenges, in this work, we present a partially supervised outfit compatibility modeling scheme (PS-OCM). In particular, we first devise a partially supervised attribute-level embedding learning component to disentangle the fine-grained attribute embeddings from the entire visual feature of each item. We then introduce a disentangled completeness regularizer to prevent the information loss during disentanglement. Thereafter, we design a hierarchical graph convolutional network, which seamlessly integrates the attribute- and item-level compatibility modeling, and enables the explainable compatibility reasoning. Extensive experiments on the real-world dataset demonstrate that our PS-OCM significantly outperforms the state-of-the-art baselines. We have released our source codes and well-trained models to benefit other researchers (https://site2750.wixsite.com/ps-ocm).

15.
Artigo em Inglês | MEDLINE | ID: mdl-35576416

RESUMO

Recently, fashion compatibility modeling, which can score the matching degree of several complementary fashion items, has gained increasing research attention. Previous studies have primarily learned the features of fashion items and utilize their interaction as the fashion compatibility. However, the try-on looking of an outfit help us to learn the fashion compatibility in a combined manner, where items are spatially distributed and partially covered by other items. Inspired by this, we design a try-on-enhanced fashion compatibility modeling framework, named TryonCM2, which incorporates the try-on appearance with the item interaction to enhance the fashion compatibility modeling. Specifically, we treat each outfit as a sequence of items and adopt the bidirectional long short-term memory (LSTM) network to capture the latent interaction of fashion items. Meanwhile, we synthesize a try-on template image to depict the try-on appearance of an outfit. And then, we regard the outfit as a sequence of multiple image stripes, i.e., local content, of the try-on template, and adopt the bidirectional LSTM network to capture the contextual structure in the try-on appearance. Ultimately, we combine the fashion compatibility lying in the item interaction and try-on appearance as the final compatibility of the outfit. Both the objective and subjective experiments on the existing FOTOS dataset demonstrate the superiority of our framework over the state-of-the-art methods.

16.
IEEE Trans Image Process ; 31: 227-238, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34847029

RESUMO

Recent studies have pointed out that many well-developed Visual Question Answering (VQA) models are heavily affected by the language prior problem. It refers to making predictions based on the co-occurrence pattern between textual questions and answers instead of reasoning upon visual contents. To tackle this problem, most existing methods focus on strengthening the visual feature learning capability to reduce this text shortcut influence on model decisions. However, few efforts have been devoted to analyzing its inherent cause and providing an explicit interpretation. It thus lacks a good guidance for the research community to move forward in a purposeful way, resulting in model construction perplexity towards overcoming this non-trivial problem. In this paper, we propose to interpret the language prior problem in VQA from a class-imbalance view. Concretely, we design a novel interpretation scheme whereby the loss of mis-predicted frequent and sparse answers from the same question type is distinctly exhibited during the late training phase. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer, to a given question whose right answer is sparse in the training set. Based upon this observation, we further propose a novel loss re-scaling approach to assign different weights to each answer according to the training data statistics for estimating the final loss. We apply our approach into six strong baselines and the experimental results on two VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.

17.
IEEE Trans Image Process ; 30: 8265-8277, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34559652

RESUMO

This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.

18.
IEEE Trans Image Process ; 30: 7732-7743, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34478369

RESUMO

Conversational image search, a revolutionary search mode, is able to interactively induce the user response to clarify their intents step by step. Several efforts have been dedicated to the conversation part, namely automatically asking the right question at the right time for user preference elicitation, while few studies focus on the image search part given the well-prepared conversational query. In this paper, we work towards conversational image search, which is much difficult compared to the traditional image search task, due to the following challenges: 1) understanding complex user intents from a multimodal conversational query; 2) utilizing multiform knowledge associated images from a memory network; and 3) enhancing the image representation with distilled knowledge. To address these problems, in this paper, we present a novel contextuaL imAge seaRch sCHeme (LARCH for short), consisting of three components. In the first component, we design a multimodal hierarchical graph-based neural network, which learns the conversational query embedding for better user intent understanding. As to the second one, we devise a multi-form knowledge embedding memory network to unify heterogeneous knowledge structures into a homogeneous base that greatly facilitates relevant knowledge retrieval. In the third component, we learn the knowledge-enhanced image representation via a novel gated neural network, which selects the useful knowledge from retrieved relevant one. Extensive experiments have shown that our LARCH yields significant performance over an extended benchmark dataset. As a side contribution, we have released the data, codes, and parameter settings to facilitate other researchers in the conversational image search community.

19.
IEEE Trans Image Process ; 30: 5933-5943, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34166192

RESUMO

Video moment localization, as an important branch of video content analysis, has attracted extensive attention in recent years. However, it is still in its infancy due to the following challenges: cross-modal semantic alignment and localization efficiency. To address these impediments, we present a cross-modal semantic alignment network. To be specific, we first design a video encoder to generate moment candidates, learn their representations, as well as model their semantic relevance. Meanwhile, we design a query encoder for diverse query intention understanding. Thereafter, we introduce a multi-granularity interaction module to deeply explore the semantic correlation between multi-modalities. Thereby, we can effectively complete target moment localization via sufficient cross-modal semantic understanding. Moreover, we introduce a semantic pruning strategy to reduce cross-modal retrieval overhead, improving localization efficiency. Experimental results on two benchmark datasets have justified the superiority of our model over several state-of-the-art competitors.

20.
IEEE Trans Image Process ; 30: 4667-4677, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33900915

RESUMO

Due to the continuous booming of surveillance and Web videos, video moment localization, as an important branch of video content analysis, has attracted wide attention from both industry and academia in recent years. It is, however, a non-trivial task due to the following challenges: temporal context modeling, intelligent moment candidate generation, as well as the necessary efficiency and scalability in practice. To address these impediments, we present a deep end-to-end cross-modal hashing network. To be specific, we first design a video encoder relying on a bidirectional temporal convolutional network to simultaneously generate moment candidates and learn their representations. Considering that the video encoder characterizes temporal contextual structures at multiple scales of time windows, we can thus obtain enhanced moment representations. As a counterpart, we design an independent query encoder towards user intention understanding. Thereafter, a cross-model hashing module is developed to project these two heterogeneous representations into a shared isomorphic Hamming space for compact hash code learning. After that, we can effectively estimate the relevance score of each "moment-query" pair via the Hamming distance. Besides effectiveness, our model is far more efficient and scalable since the hash codes of videos can be learned offline. Experimental results on real-world datasets have justified the superiority of our model over several state-of-the-art competitors.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA