Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.

Zhang, Yujia; Li, Qianzhong; Pan, Yi; Zhao, Xiaoguang; Tan, Min

Zhang, Yujia; Li, Qianzhong; Pan, Yi; Zhao, Xiaoguang; Tan, Min.

IEEE Trans Image Process ; 33: 3256-3270, 2024.

Article em En | MEDLINE | ID: mdl-38696298

ABSTRACT

ABSTRACT

Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.

Assuntos

Algoritmos; Processamento de Imagem Assistida por Computador; Gravação em Vídeo; Gravação em Vídeo/métodos; Processamento de Imagem Assistida por Computador/métodos; Humanos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Gravação em Vídeo / Algoritmos / Processamento de Imagem Assistida por Computador Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google