Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models.

Do, Nhu-Tai; Kim, Soo-Hyung; Yang, Hyung-Jeong; Lee, Guee-Sang; Yeom, Soonja

Do, Nhu-Tai; Kim, Soo-Hyung; Yang, Hyung-Jeong; Lee, Guee-Sang; Yeom, Soonja.

Afiliação

Do NT; Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea.
Kim SH; Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea.
Yang HJ; Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea.
Lee GS; Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea.
Yeom S; School of Technology, Environment and Design, University of Tasmania, Hobart, TAS 7001, Australia.

Sensors (Basel) ; 21(7)2021 Mar 27.

Article em En | MEDLINE | ID: mdl-33801739

RESUMO

Emotion recognition plays an important role in human-computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as frame-level, spatiotemporal, and audio approaches. However, such methods have difficulties in exploiting long-term dependencies in temporal information, capturing contextual information, and integrating multi-modal information. In this paper, we introduce a multi-modal flexible system for video-based emotion recognition in the wild. Our system tracks and votes on significant faces corresponding to persons of interest in a video to classify seven basic emotions. The key contribution of this study is that it proposes the use of face feature extraction with context-aware and statistical information for emotion recognition. We also build two model architectures to effectively exploit long-term dependencies in temporal information with a temporal-pyramid model and a spatiotemporal model with "Conv2D+LSTM+3DCNN+Classify" architecture. Finally, we propose the best selection ensemble to improve the accuracy of multi-modal fusion. The best selection ensemble selects the best combination from spatiotemporal and temporal-pyramid models to achieve the best accuracy for classifying the seven basic emotions. In our experiment, we take benchmark measurement on the AFEW dataset with high accuracy.

Assuntos

Conscientização; Emoções; Humanos; Estimulação Luminosa; Modalidades de Fisioterapia

Palavras-chave

best selection ensemble; facial emotion recognition; spatiotemporal; temporal-pyramid; video emotion recognition

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Conscientização / Emoções Limite: Humans Idioma: En Revista: Sensors (Basel) Ano de publicação: 2021 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google