RESUMO
The goal of multi-modal neural machine translation (MNMT) is to incorporate language-agnostic visual information into text to enhance the performance of machine translation. However, due to the inherent differences between image and text, these two modalities inevitably suffer from semantic mismatch problems. To tackle this issue, this paper adopts a multi-grained visual pivot-guided multi-modal fusion strategy with cross-modal contrastive disentangling to eliminate the linguistic gaps between different languages. By using the disentangled multi-grained visual information as a cross-lingual pivot, we can enhance the alignment between different languages and improve the performance of MNMT. We first introduce text-guided stacked cross-modal disentangling modules to progressively disentangle image into two types of visual information: MT-related visual and background information. Then we effectively integrate these two kinds of multi-grained visual elements to assist target sentence generation. Extensive experiments on four benchmark MNMT datasets are conducted, and the results demonstrate that our proposed approach achieves significant improvement over the other state-of-the-art (SOTA) approaches on all test sets. The in-depth analysis highlights the benefits of text-guided cross-modal disentangling and visual pivot-based multi-modal fusion strategies in MNMT. We release the code at https://github.com/nlp-mnmt/ConVisPiv-MNMT.
Assuntos
Redes Neurais de Computação , Humanos , Idioma , SemânticaRESUMO
In this article, we conduct an extensive quantitative error analysis of different multi-modal neural machine translation (MNMT) models which integrate visual features into different parts of both the encoder and the decoder. We investigate the scenario where models are trained on an in-domain training data set of parallel sentence pairs with images. We analyse two different types of MNMT models, that use global and local image features: the latter encode an image globally, i.e. there is one feature vector representing an entire image, whereas the former encode spatial information, i.e. there are multiple feature vectors, each encoding different portions of the image. We conduct an error analysis of translations generated by different MNMT models as well as text-only baselines, where we study how multi-modal models compare when translating both visual and non-visual terms. In general, we find that the additional multi-modal signals consistently improve translations, even more so when using simpler MNMT models that use global visual features. We also find that not only translations of terms with a strong visual connotation are improved, but almost all kinds of errors decreased when using multi-modal models.