Your browser doesn't support javascript.
loading
Video Captioning Using Global-Local Representation.
Yan, Liqi; Ma, Siqi; Wang, Qifan; Chen, Yingjie; Zhang, Xiangyu; Savakis, Andreas; Liu, Dongfang.
Afiliación
  • Yan L; Fudan University, China.; Westlake University, China; Rochester Institute of Technology, USA.
  • Ma S; Westlake University, China.
  • Wang Q; Meta AI, USA. This work is done before joining Meta AI.
  • Chen Y; Purdue University, USA.
  • Zhang X; Purdue University, USA.
  • Savakis A; Rochester Institute of Technology, USA.
  • Liu D; Rochester Institute of Technology, USA.
IEEE Trans Circuits Syst Video Technol ; 32(10): 6642-6656, 2022 Oct.
Article en En | MEDLINE | ID: mdl-37215187
ABSTRACT
Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local vision representation for sentence generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GLR framework, namely a global-local representation granularity. Our GLR demonstrates three advantages over the prior efforts. First, we propose a simple solution, which exploits extensive vision representations from different video ranges to improve linguistic expression. Second, we devise a novel global-local encoder, which encodes different video representations including long-range, short-range and local-keyframe, to produce rich semantic vocabulary for obtaining a descriptive granularity of video contents across frames. Finally, we introduce the progressive training strategy which can effectively organize feature learning to incur optimal captioning behavior. Evaluated on the MSR-VTT and MSVD dataset, we outperform recent state-of-the-art methods including a well-tuned SA-LSTM baseline by a significant margin, with shorter training schedules. Because of its simplicity and efficacy, we hope that our GLR could serve as a strong baseline for many video understanding tasks besides video captioning. Code will be available.
Palabras clave

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: IEEE Trans Circuits Syst Video Technol Año: 2022 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: IEEE Trans Circuits Syst Video Technol Año: 2022 Tipo del documento: Article País de afiliación: Estados Unidos