Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text Retrieval.

Ma, Wentao; Chen, Qingchao; Liu, Fang; Zhou, Tongqing; Cai, Zhiping

Recently, a hierarchical fine-grained fusion mechanism has been proved effective in cross-modal retrieval between videos and texts. Generally, the hierarchical fine-grained semantic representations (video-text semantic matching is decomposed into three levels including global-event representation matching, action-relation representation matching, and local-entity representation matching) to be fused can work well by themselves for the query. However, in real-world scenarios and applications, existing methods failed to adaptively estimate the effectiveness of multiple levels of the semantic representations for a given query in advance of multilevel fusion, resulting in a worse performance than expected. As a result, it is extremely essential to identify the effectiveness of hierarchical semantic representations in a query-adaptive manner. To this end, this article proposes an effective query-adaptive multilevel fusion (QAMF) model based on manipulating multiple similarity scores between the hierarchical visual and text representations. First, we decompose video-side and text-side representations into hierarchical semantic representations consisting of global-event level, action-relation level, and local-entity level, respectively. Then, the multilevel representation of the video-text pair is aligned to calculate the similarity score for each level. Meanwhile, the sorted similarity score curves of the good semantic representation are different from the inferior ones, which exhibit a "cliff" shape and gradually decline (see Fig. fig1 as an example). Finally, we leverage the Gaussian decay function to fit the tail of the score curve and calculate the area under the normalized sorted similarity curve as the indicator of semantic representation effectiveness, namely, the area of good semantic representation is small, and vice versa. Extensive experiments on three public benchmark video-text datasets have demonstrated that our method consistently outperforms the state-of-the-art (SoTA). A simple demo of QAMF will soon be publicly available on our homepage: https://github.com/Lab-ANT.