Pesquisa | Portal Regional da BVS

Conformer: Local Features Coupling Global Representations for Recognition and Detection.

Peng, Zhiliang; Guo, Zonghao; Huang, Wei; Wang, Yaowei; Xie, Lingxi; Jiao, Jianbin; Tian, Qi; Ye, Qixiang.

IEEE Trans Pattern Anal Mach Intell ; 45(8): 9454-9468, 2023 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-37022836

RESUMO

With convolution operations, Convolutional Neural Networks (CNNs) are good at extracting local features but experience difficulty to capture global representations. With cascaded self-attention modules, vision transformers can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take both advantages of convolution operations and self-attention mechanisms for enhanced representation learning. Conformer roots in feature coupling of CNN local features and transformer global representations under different resolutions in an interactive fashion. Conformer adopts a dual structure so that local details and global dependencies are retained to the maximum extent. We also propose a Conformer-based detector (ConformerDet), which learns to predict and refine object proposals, by performing region-level feature coupling in an augmented cross-attention fashion. Experiments on ImageNet and MS COCO datasets validate Conformer's superiority for visual recognition and object detection, demonstrating its potential to be a general backbone network.

Assuntos

Algoritmos , Aprendizagem , Redes Neurais de Computação

TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.

Yao, Yuan; Wan, Fang; Gao, Wei; Pan, Xingjia; Peng, Zhiliang; Tian, Qi; Ye, Qixiang.

IEEE Trans Neural Netw Learn Syst ; PP2022 Nov 23.

Artigo em Inglês | MEDLINE | ID: mdl-36417732

RESUMO

Weakly supervised object localization (WSOL), which trains object localization models using solely image category annotations, remains a challenging problem. Existing approaches based on convolutional neural networks (CNNs) tend to miss full object extent while activating discriminative object parts. Based on our analysis, this is caused by CNN's intrinsic characteristics, which experiences difficulty to capture object semantics at long distances. In this article, we introduce the vision transformer to WSOL, with the aim to capture long-range semantic dependency of features by leveraging transformer's cascaded self-attention mechanism. We propose the token semantic coupled attention map (TS-CAM) method, which first decomposes class-aware semantics and then couples the semantics with attention maps for semantic-aware activation. To capture object semantics at long distances and avoid partial activation, TS-CAM performs spatial embedding by partitioning an image to a set of patch tokens. To incorporate object category information to patch tokens, TS-CAM reallocates category-related semantics to each patch token. The patch tokens are finally coupled with attention maps which are semantic-agnostic to perform semantic-aware object localization. By introducing semantic tokens to produce semantic-aware attention maps, we further explore the capability of TS-CAM for multicategory object localization. Experiments show that TS-CAM outperforms its CNN-CAM counterpart by 11.6% and 28.9% on ILSVRC and CUB-200-2011 datasets, respectively, improving the state-of-the-art with large margins. TS-CAM also demonstrates superiority for multicategory object localization on the Pascal VOC dataset. The code is available at github.com/yuanyao366/ts-cam-extension.

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA