Audio-Visual Fusion Based on Interactive Attention for Person Verification.

Jing, Xuebin; He, Liang; Song, Zhida; Wang, Shaolei

Jing, Xuebin; He, Liang; Song, Zhida; Wang, Shaolei.

Afiliação

Jing X; School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China.
He L; Xinjiang Key Laboratory of Signal Detection and Processing, Urumqi 830017, China.
Song Z; School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China.
Wang S; Xinjiang Key Laboratory of Signal Detection and Processing, Urumqi 830017, China.

Sensors (Basel) ; 23(24)2023 Dec 15.

Article em En | MEDLINE | ID: mdl-38139689

ABSTRACT

ABSTRACT

With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio-visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms attention, gated, and inter-attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems.

Assuntos

Identificação Biométrica; Tecnologia da Informação; Humanos

Palavras-chave

attention; audiovisual fusion; face verification; gated; interattention; speaker verification

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Tecnologia da Informação / Identificação Biométrica Limite: Humans Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google