Rectify representation bias in vision-language models for long-tailed recognition.

Li, Bo; Yao, Yongqiang; Tan, Jingru; Gong, Ruihao; Lu, Jianwei; Luo, Ye

Li, Bo; Yao, Yongqiang; Tan, Jingru; Gong, Ruihao; Lu, Jianwei; Luo, Ye.

Afiliação

Li B; Tongji University, No. 4800 Caoan Road, Shanghai, 201804, China.
Yao Y; Sensetime Research, No. 1900 Hongmei Road, Shanghai, 201103, China.
Tan J; Central South University, No. 932 South Lushan Road, Changsha, 410083, Hunan, China. Electronic address: tanjingru@csu.edu.cn.
Gong R; Sensetime Research, No. 1900 Hongmei Road, Shanghai, 201103, China.
Lu J; Shanghai University of Traditional Chinese Medicine, No. 530 Lingling Road, Shanghai, 201203, China. Electronic address: jwlu33@shutcm.edu.cn.
Luo Y; Tongji University, No. 4800 Caoan Road, Shanghai, 201804, China. Electronic address: yeluo@tongji.edu.cn.

Neural Netw ; 172: 106134, 2024 Apr.

Article em En | MEDLINE | ID: mdl-38245924

ABSTRACT

ABSTRACT

Natural data typically exhibits a long-tailed distribution, presenting great challenges for recognition tasks. Due to the extreme scarcity of training instances, tail classes often show inferior performance. In this paper, we investigate the problem within the trendy visual-language (VL) framework and find that the performance bottleneck mainly arises from the recognition confusion between tail classes and their highly correlated head classes. Building upon this observation, unlike previous research primarily emphasizing class frequency in addressing long-tailed issues, we take a novel perspective by incorporating a crucial additional factor namely class correlation. Specifically, we model the representation learning procedure for each sample as two parts, i.e., a special part that learns the unique properties of its own class and a common part that learns shared characteristics among classes. By analysis, we discover that the learning process of common representation is easily biased toward head classes. Because of the bias, the network may lean towards the biased common representation as classification criteria, rather than prioritizing the crucial information encapsulated within the specific representation, ultimately leading to recognition confusion. To solve the problem, based on the VL framework, we introduce the rectification contrastive term (ReCT) to rectify the representation bias, according to semantic hints and training status. Extensive experiments on three widely-used long-tailed datasets demonstrate the effectiveness of ReCT. On iNaturalist2018, it achieves an overall accuracy of 75.4%, surpassing the baseline by 3.6 points in a ResNet-50 visual backbone.

Assuntos

Idioma; Semântica; Viés; Aprendizagem; Reconhecimento Psicológico

Palavras-chave

Long-tailed recognition; Representation bias; Vision-language model

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Semântica / Idioma Tipo de estudo: Prognostic_studies Idioma: En Revista: Neural Netw Assunto da revista: NEUROLOGIA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: China País de publicação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google