Your browser doesn't support javascript.
loading
Integrating audio and visual modalities for multimodal personality trait recognition via hybrid deep learning.
Zhao, Xiaoming; Liao, Yuehui; Tang, Zhiwei; Xu, Yicheng; Tao, Xin; Wang, Dandan; Wang, Guoyu; Lu, Hongsheng.
Afiliação
  • Zhao X; Taizhou Central Hospital (Taizhou University Hospital), Taizhou University, Taizhou, Zhejiang, China.
  • Liao Y; Taizhou Central Hospital (Taizhou University Hospital), Taizhou University, Taizhou, Zhejiang, China.
  • Tang Z; School of Computer Science, Hangzhou Dianzi University, Hangzhou, China.
  • Xu Y; Taizhou Central Hospital (Taizhou University Hospital), Taizhou University, Taizhou, Zhejiang, China.
  • Tao X; School of Information Technology Engineering, Taizhou Vocational and Technical College, Taizhou, Zhejiang, China.
  • Wang D; Taizhou Central Hospital (Taizhou University Hospital), Taizhou University, Taizhou, Zhejiang, China.
  • Wang G; Taizhou Central Hospital (Taizhou University Hospital), Taizhou University, Taizhou, Zhejiang, China.
  • Lu H; Taizhou Central Hospital (Taizhou University Hospital), Taizhou University, Taizhou, Zhejiang, China.
Front Neurosci ; 16: 1107284, 2022.
Article em En | MEDLINE | ID: mdl-36685221
Recently, personality trait recognition, which aims to identify people's first impression behavior data and analyze people's psychological characteristics, has been an interesting and active topic in psychology, affective neuroscience and artificial intelligence. To effectively take advantage of spatio-temporal cues in audio-visual modalities, this paper proposes a new method of multimodal personality trait recognition integrating audio-visual modalities based on a hybrid deep learning framework, which is comprised of convolutional neural networks (CNN), bi-directional long short-term memory network (Bi-LSTM), and the Transformer network. In particular, a pre-trained deep audio CNN model is used to learn high-level segment-level audio features. A pre-trained deep face CNN model is leveraged to separately learn high-level frame-level global scene features and local face features from each frame in dynamic video sequences. Then, these extracted deep audio-visual features are fed into a Bi-LSTM and a Transformer network to individually capture long-term temporal dependency, thereby producing the final global audio and visual features for downstream tasks. Finally, a linear regression method is employed to conduct the single audio-based and visual-based personality trait recognition tasks, followed by a decision-level fusion strategy used for producing the final Big-Five personality scores and interview scores. Experimental results on the public ChaLearn First Impression-V2 personality dataset show the effectiveness of our method, outperforming other used methods.
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Revista: Front Neurosci Ano de publicação: 2022 Tipo de documento: Article País de afiliação: China País de publicação: Suíça

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Revista: Front Neurosci Ano de publicação: 2022 Tipo de documento: Article País de afiliação: China País de publicação: Suíça