Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
IEEE Trans Pattern Anal Mach Intell ; 46(8): 5712-5724, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38421845

RESUMO

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. As a result, it is necessary to collect and label data-text pairs for training, which is both costly and time-consuming. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and "believable" outputs and significantly outperforms existing zero-shot methods.

2.
IEEE Trans Image Process ; 32: 5366-5378, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37639408

RESUMO

Concepts, a collective term for meaningful words that correspond to objects, actions, and attributes, can act as an intermediary for video captioning. While many efforts have been made to augment video captioning with concepts, most methods suffer from limited precision of concept detection and insufficient utilization of concepts, which could provide caption generation with inaccurate and inadequate prior information. Considering these issues, we propose a Concept-awARE video captioning framework (CARE) to facilitate plausible caption generation. Based on the encoder-decoder structure, CARE detects concepts precisely via multimodal-driven concept detection (MCD) and offers sufficient prior information to caption generation by global-local semantic guidance (G-LSG). Specifically, we implement MCD by leveraging video-to-text retrieval and the multimedia nature of videos. To achieve G-LSG, given the concept probabilities predicted by MCD, we weight and aggregate concepts to mine the video's latent topic to affect decoding globally and devise a simple yet efficient hybrid attention module to exploit concepts and video content to impact decoding locally. Finally, to develop CARE, we emphasize on the knowledge transfer of a contrastive vision-language pre-trained model (i.e., CLIP) in terms of visual understanding and video-to-text retrieval. With the multi-role CLIP, CARE can outperform CLIP-based strong video captioning baselines with affordable extra parameter and inference latency costs. Extensive experiments on MSVD, MSR-VTT, and VATEX datasets demonstrate the versatility of our approach for different encoder-decoder networks and the superiority of CARE against state-of-the-art methods. Our code is available at https://github.com/yangbang18/CARE.

3.
IEEE Trans Image Process ; 31: 5203-5213, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35914045

RESUMO

Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness. Specifically, a motion graph is introduced to model motionness based on the local motion carrier (e.g., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3.

4.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 9255-9268, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34855588

RESUMO

Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.

5.
IEEE Trans Image Process ; 30: 9294-9305, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34752393

RESUMO

Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects via inferring triplets of 〈 human, verb, object 〉 . Recent HOI detection methods infer HOIs by directly extracting appearance features and spatial configuration from related visual targets of human and object, but neglect powerful interactive semantic reasoning between these targets. Meanwhile, existing spatial encodings of visual targets have been simply concatenated to appearance features, which is unable to dynamically promote the visual feature learning. To solve these problems, we first present a novel semantic-based Interactive Reasoning Block, in which interactive semantics implied among visual targets are efficiently exploited. Beyond inferring HOIs using discrete instance features, we then design a HOI Inferring Structure to parse pairwise interactive semantics among visual targets in scene-wide level and instance-wide level. Furthermore, we propose a Spatial Guidance Model based on the location of human body-parts and object, which serves as a geometric guidance to dynamically enhance the visual feature learning. Based on the above modules, we construct a framework named Interactive-Net for HOI detection, which is fully differentiable and end-to-end trainable. Extensive experiments show that our proposed framework outperforms existing HOI detection methods on both V-COCO and HICO-DET benchmarks and improves the baseline about 5.9% and 17.7% relatively, validating its efficacy in detecting HOIs.


Assuntos
Algoritmos , Semântica , Humanos
6.
IEEE Trans Image Process ; 27(7): 3248-3263, 2018 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-29641404

RESUMO

Visual object counting (VOC) is an emerging area in computer vision which aims to estimate the number of objects of interest in a given image or video. Recently, object density based estimation method is shown to be promising for object counting as well as rough instance localization. However, the performance of this method tends to degrade when dealing with new objects and scenes. To address this limitation, we propose a manifold-based method for visual object counting (M-VOC), based on the manifold assumption that similar image patches share similar object densities. Firstly, the local geometry of a given image patch is represented linearly by its neighbors using a predefined patch training set, and the object density of this given image patch is reconstructed by preserving the local geometry using locally linear embedding. To improve the characterization of local geometry, additional constraints such as sparsity and non-negativity are also considered via regularization, nonlinear mapping, and kernel trick. Compared with the state-of-the-art VOC methods, our proposed M-VOC methods achieve competitive performance on seven benchmark datasets. Experiments verify that the proposed M-VOC methods have several favorable properties, such as robustness to the variation in the size of training dataset and image resolution, as often encountered in real-world VOC applications.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA