Cross-view motion consistent self-supervised video inter-intra contrastive for action representation understanding.

Bi, Shuai; Hu, Zhengping; Zhang, Hehao; Di, Jirui; Sun, Zhe

Bi, Shuai; Hu, Zhengping; Zhang, Hehao; Di, Jirui; Sun, Zhe.

Afiliação

Bi S; School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066000, China; Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, 066000, China. Electronic address: xiaoshuai0710@163.com.
Hu Z; School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066000, China; Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, 066000, China. Electronic address: hzp@ysu.edu.cn.
Zhang H; School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066000, China; Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, 066000, China. Electronic address: zhanghh@stumail.ysu.edu.cn.
Di J; School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066000, China; Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, 066000, China. Electronic address: dijirui@stumail.ysu.edu.cn.
Sun Z; School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066000, China; Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, 066000, China. Electronic address: zhe.sun@ysu.edu.cn.

Neural Netw ; 179: 106578, 2024 Nov.

Article em En | MEDLINE | ID: mdl-39111158

ABSTRACT

ABSTRACT

Self-supervised contrastive learning draws on power representational models to acquire generic semantic features from unlabeled data, and the key to training such models lies in how accurately to track motion features. Previous video contrastive learning methods have extensively used spatially or temporally augmentation as similar instances, resulting in models that are more likely to learn static backgrounds than motion features. To alleviate the background shortcuts, in this paper, we propose a cross-view motion consistent (CVMC) self-supervised video inter-intra contrastive model to focus on the learning of local details and long-term temporal relationships. Specifically, we first extract the dynamic features of consecutive video snippets and then align these features based on multi-view motion consistency. Meanwhile, we compare the optimized dynamic features for instance comparison of different videos and local spatial fine-grained with temporal order in the same video, respectively. Ultimately, the joint optimization of spatio-temporal alignment and motion discrimination effectively fills the challenges of the missing components of instance recognition, spatial compactness, and temporal perception in self-supervised learning. Experimental results show that our proposed self-supervised model can effectively learn visual representation information and achieve highly competitive performance compared to other state-of-the-art methods in both action recognition and video retrieval tasks.

Assuntos

Gravação em Vídeo; Humanos; Redes Neurais de Computação; Percepção de Movimento/fisiologia; Aprendizado de Máquina Supervisionado; Movimento (Física); Algoritmos

Palavras-chave

Cross-view learning; Self-supervised contrastive learning; Unsupervised learning; Video action understanding

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Gravação em Vídeo Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Gravação em Vídeo Idioma: En Ano de publicação: 2024 Tipo de documento: Article