Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory.

Cao, Pengfei; Yang, Zhongyi; Sun, Liang; Liang, Yanchun; Yang, Mary Qu; Guan, Renchu

Cao, Pengfei; Yang, Zhongyi; Sun, Liang; Liang, Yanchun; Yang, Mary Qu; Guan, Renchu.

Afiliação

Cao P; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China.
Yang Z; University of Chinese Academy of Sciences, Beijing 100049, China.
Sun L; National Laboratory of Pattern Recognition (NLPR), Institute of Automation Chinese Academy of Sciences, Beijing 100190, China.
Liang Y; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China.
Yang MQ; College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
Guan R; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China.

Neural Process Lett ; 50(1): 103-119, 2019 Aug.

Article em En | MEDLINE | ID: mdl-35035261

ABSTRACT

ABSTRACT

Automatically describing contents of an image using natural language has drawn much attention because it not only integrates computer vision and natural language processing but also has practical applications. Using an end-to-end approach, we propose a bidirectional semantic attention-based guiding of long short-term memory (Bag-LSTM) model for image captioning. The proposed model consciously refines image features from previously generated text. By fine-tuning the parameters of convolution neural networks, Bag-LSTM obtains more text-related image features via feedback propagation than other models. As opposed to existing guidance-LSTM methods which directly add image features into each unit of an LSTM block, our fine-tuned model dynamically leverages more text-conditional image features, acquired by the semantic attention mechanism, as guidance information. Moreover, we exploit bidirectional gLSTM as the caption generator, which is capable of learning long term relations between visual features and semantic information by making use of both historical and future contextual information. In addition, variations of the Bag-LSTM model are proposed in an effort to sufficiently describe high-level visual-language interactions. Experiments on the Flickr8k and MSCOCO benchmark datasets demonstrate the effectiveness of the model, as compared with the baseline algorithms, such as it is 51.2% higher than BRNN on CIDEr metric.

Palavras-chave

Bidirectional guiding LSTM; Convolution neural network; Image captioning; Semantic attention mechanism

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Tipo de estudo: Guideline / Prognostic_studies Idioma: En Revista: Neural Process Lett Ano de publicação: 2019 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google