Spatial-Temporal Pyramid Graph Reasoning for Action Recognition.
IEEE Trans Image Process
; 31: 5484-5497, 2022.
Article
em En
| MEDLINE
| ID: mdl-35943998
Spatial-temporal relation reasoning is a significant yet challenging problem for video action recognition. Previous works typically apply local operations like 2D or 3D CNNs to conduct space-time interactions in video sequences, or simply capture space-time long-range relations of a single fixed scale. However, this is inadequate for obtaining a comprehensive action representation. Besides, most models treat all input frames equally for the final classification, without selecting key frames and motion-sensitive regions. This introduces irrelevant video content and hurts the performance of models. In this paper, we propose a generic Spatial-Temporal Pyramid Graph Network (STPG-Net) to adaptively capture long-range spatial-temporal relations in video sequences at multiple scales. Specifically, we design a temporal attention (TA) module and a spatial-temporal attention (STA) module to learn the contribution of each frame and each space-time region to an action at a feature level, respectively. We then apply the selected key information to build spatial-temporal pyramid graphs for long-range relation reasoning and more comprehensive action representation learning. STPG-Net can be flexibly integrated into 2D and 3D backbone networks in a plug-and-play manner. Extensive experiments show that it brings consistent improvements over many challenging baselines on several standard action recognition benchmarks (i.e., Something-Something V1 & V2, and FineGym), demonstrating the effectiveness of our approach.
Texto completo:
1
Coleções:
01-internacional
Base de dados:
MEDLINE
Idioma:
En
Revista:
IEEE Trans Image Process
Ano de publicação:
2022
Tipo de documento:
Article