Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2819-2837, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38015700

RESUMO

Cloth-changing person reidentification (ReID) is a newly emerging research topic aimed at addressing the issues of large feature variations due to cloth-changing and pedestrian view/pose changes. Although significant progress has been achieved by introducing extra information (e.g., human contour sketching information, human body keypoints, and 3D human information), cloth-changing person ReID remains challenging because pedestrian appearance representations can change at any time. Moreover, human semantic information and pedestrian identity information are not fully explored. To solve these issues, we propose a novel identity-guided collaborative learning scheme (IGCL) for cloth-changing person ReID, where the human semantic is effectively utilized and the identity is unchangeable to guide collaborative learning. First, we design a novel clothing attention degradation stream to reasonably reduce the interference caused by clothing information where clothing attention and mid-level collaborative learning are employed. Second, we propose a human semantic attention and body jigsaw stream to highlight the human semantic information and simulate different poses of the same identity. In this way, the extraction features not only focus on human semantic information that is unrelated to the background but are also suitable for pedestrian pose variations. Moreover, a pedestrian identity enhancement stream is proposed to enhance the identity importance and extract more favorable identity robust features. Most importantly, all these streams are jointly explored in an end-to-end unified framework, and the identity is utilized to guide the optimization. Extensive experiments on six public clothing person ReID datasets (LaST, LTCC, PRCC, NKUP, Celeb-reID-light, and VC-Clothes) demonstrate the superiority of the IGCL method. It outperforms existing methods on multiple datasets, and the extracted features have stronger representation and discrimination ability and are weakly correlated with clothing.


Assuntos
Práticas Interdisciplinares , Pedestres , Humanos , Algoritmos , Semântica
2.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3665-3678, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38145530

RESUMO

The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multiple query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multiple matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a CLIP-Transformer based muLtI-factor Matching Network (LIMN), which consists of three key modules: disentanglement-based latent factor tokens mining, dual aggregation-based matching token learning, and dual query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a weakly-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on four datasets, including FashionIQ, Shoes, CIRR, and Fashion200 K, show that our proposed LIMN and LIMN+ significantly surpass the state-of-the-art baselines.

3.
Artigo em Inglês | MEDLINE | ID: mdl-37943645

RESUMO

Cloth-changing person re-identification (ReID) is a newly emerging research topic that aims to retrieve pedestrians whose clothes are changed. Since the human appearance with different clothes exhibits large variations, it is very difficult for existing approaches to extract discriminative and robust feature representations. Current works mainly focus on body shape or contour sketches, but the human semantic information and the potential consistency of pedestrian features before and after changing clothes are not fully explored or are ignored. To solve these issues, in this work, a novel semantic-aware attention and visual shielding network for cloth-changing person ReID (abbreviated as SAVS) is proposed where the key idea is to shield clues related to the appearance of clothes and only focus on visual semantic information that is not sensitive to view/posture changes. Specifically, a visual semantic encoder is first employed to locate the human body and clothing regions based on human semantic segmentation information. Then, a human semantic attention (HSA) module is proposed to highlight the human semantic information and reweight the visual feature map. In addition, a visual clothes shielding (VCS) module is also designed to extract a more robust feature representation for the cloth-changing task by covering the clothing regions and focusing the model on the visual semantic information unrelated to the clothes. Most importantly, these two modules are jointly explored in an end-to-end unified framework. Extensive experiments demonstrate that the proposed method can significantly outperform state-of-the-art methods, and more robust features can be extracted for cloth-changing persons. Compared with multibiometric unified network (MBUNet) (published in TIP2023), this method can achieve improvements of 17.5% (30.9%) and 8.5% (10.4%) on the LTCC and Celeb-reID datasets in terms of mean average precision (mAP) (rank-1), respectively. When compared with the Swin Transformer (Swin-T), the improvements can reach 28.6% (17.3%), 22.5% (10.0%), 19.5% (10.2%), and 8.6% (10.1%) on the PRCC, LTCC, Celeb, and NKUP datasets in terms of rank-1 (mAP), respectively.

4.
Artigo em Inglês | MEDLINE | ID: mdl-35954877

RESUMO

This study used a 2 × 2 experimental design to explore the effects of message type (non-narrative vs. narrative information) and social media metrics (high vs. low numbers of plays) of low-carbon-themed social media short videos on people's willingness to protect the environment. Subjects completed questionnaires after viewing short videos that contained different message types and social media metrics, and a final sample of 295 cases was included in the data analysis. The study found that, while the type of information (i.e., non-narrative or narrative) of the low-carbon-themed social media short videos had no direct effect on people's willingness to protect the environment, its indirect effects were significant. These indirect effects were achieved through immersion experience and social influence. Subjects who watched narrative videos had a higher level of immersion experience, which in turn was significantly and positively correlated with environmental intention; meanwhile, those who watched non-narrative videos experienced a higher level of social influence, which in turn was significantly and positively correlated with environmental intention. In addition, subjects who viewed high-volume videos experienced a more positive effect on their willingness to protect the environment. This study explored how message design could promote subjects' perceptions and positive attitudes towards environmental protection, with important managerial implications.


Assuntos
Mídias Sociais , Carbono , Humanos , Intenção , Narração , Gravação de Videoteipe
5.
IEEE Trans Image Process ; 31: 4733-4745, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35793293

RESUMO

Fashion Compatibility Modeling (FCM), which aims to automatically evaluate whether a given set of fashion items makes a compatible outfit, has attracted increasing research attention. Recent studies have demonstrated the benefits of conducting the item representation disentanglement towards FCM. Although these efforts have achieved prominent progress, they still perform unsatisfactorily, as they mainly investigate the visual content of fashion items, while overlooking the semantic attributes of items (e.g., color and pattern), which could largely boost the model performance and interpretability. To address this issue, we propose to comprehensively explore the visual content and attributes of fashion items towards FCM. This problem is non-trivial considering the following challenges: a) how to utilize the irregular attribute labels of items to partially supervise the attribute-level representation learning of fashion items; b) how to ensure the intact disentanglement of attribute-level representations; and c) how to effectively sew the multiple granulairites (i.e, coarse-grained item-level and fine-grained attribute-level) information to enable performance improvement and interpretability. To address these challenges, in this work, we present a partially supervised outfit compatibility modeling scheme (PS-OCM). In particular, we first devise a partially supervised attribute-level embedding learning component to disentangle the fine-grained attribute embeddings from the entire visual feature of each item. We then introduce a disentangled completeness regularizer to prevent the information loss during disentanglement. Thereafter, we design a hierarchical graph convolutional network, which seamlessly integrates the attribute- and item-level compatibility modeling, and enables the explainable compatibility reasoning. Extensive experiments on the real-world dataset demonstrate that our PS-OCM significantly outperforms the state-of-the-art baselines. We have released our source codes and well-trained models to benefit other researchers (https://site2750.wixsite.com/ps-ocm).

6.
IEEE Trans Image Process ; 31: 4746-4760, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35802541

RESUMO

Temporal action localization is currently an active research topic in computer vision and machine learning due to its usage in smart surveillance. It is a challenging problem since the categories of the actions must be classified in untrimmed videos and the start and end of the actions need to be accurately found. Although many temporal action localization methods have been proposed, they require substantial amounts of computational resources for the training and inference processes. To solve these issues, in this work, a novel temporal-aware relation and attention network (abbreviated as TRA) is proposed for the temporal action localization task. TRA has an anchor-free and end-to-end architecture that fully uses temporal-aware information. Specifically, a temporal self-attention module is first designed to determine the relationship between different temporal positions, and more weight is given to features within the actions. Then, a multiple temporal aggregation module is constructed to aggregate the temporal domain information. Finally, a graph relation module is designed to obtain the aggregated graph features, which are used to refine the boundaries and classification results. Most importantly, these three modules are jointly explored in a unified framework, and temporal awareness is always fully used. Extensive experiments demonstrate that the proposed method can outperform all state-of-the-art methods on the THUMOS14 dataset with an average mAP that reaches 67.6% and obtain a comparable result on the ActivityNet1.3 dataset with an average mAP that reaches 34.4%. Compared with A2Net (TIP20), PCG-TAL (TIP21), and AFSD (CVPR21) TRA can achieve improvements of 11.7%, 4.4%, and 1.8%, respectively on the THUMOS14 dataset.

7.
IEEE Trans Cybern ; 52(2): 1247-1257, 2022 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-32568717

RESUMO

Automatic image captioning is to conduct the cross-modal conversion from image visual content to natural language text. Involving computer vision (CV) and natural language processing (NLP), it has become one of the most sophisticated research issues in the artificial-intelligence area. Based on the deep neural network, the neural image caption (NIC) model has achieved remarkable performance in image captioning, yet there still remain some essential challenges, such as the deviation between descriptive sentences generated by the model and the intrinsic content expressed by the image, the low accuracy of the image scene description, and the monotony of generated sentences. In addition, most of the current datasets and methods for image captioning are in English. However, considering the distinction between Chinese and English in syntax and semantics, it is necessary to develop specialized Chinese image caption generation methods to accommodate the difference. To solve the aforementioned problems, we design the NICVATP2L model via visual attention and topic modeling, in which the visual attention mechanism reduces the deviation and the topic model improves the accuracy and diversity of generated sentences. Specifically, in the encoding phase, convolutional neural network (CNN) and topic model are used to extract visual and topic features of the input images, respectively. In the decoding phase, an attention mechanism is applied to processing image visual features for obtaining image visual region features. Finally, the topic features and the visual region features are combined to guide the two-layer long short-term memory (LSTM) network for generating Chinese image captions. To justify our model, we have conducted experiments over the Chinese AIC-ICC image dataset. The experimental results show that our model can automatically generate more informative and descriptive captions in Chinese in a more natural way, and it outperforms the existing image captioning NIC model.


Assuntos
Idioma , Processamento de Linguagem Natural , China , Redes Neurais de Computação , Semântica
8.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 9733-9740, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-34762584

RESUMO

In recent years, remarkable progress in zero-shot learning (ZSL) has been achieved by generative adversarial networks (GAN). To compensate for the lack of training samples in ZSL, a surge of GAN architectures have been developed by human experts through trial-and-error testing. Despite their efficacy, however, there is still no guarantee that these hand-crafted models can consistently achieve good performance across diversified datasets or scenarios. Accordingly, in this paper, we turn to neural architecture search (NAS) and make the first attempt to bring NAS techniques into the ZSL realm. Specifically, we propose a differentiable GAN architecture search method over a specifically designed search space for zero-shot learning, referred to as ZeroNAS. Considering the relevance and balance of the generator and discriminator, ZeroNAS jointly searches their architectures in a min-max player game via adversarial training. Extensive experiments conducted on four widely used benchmark datasets demonstrate that ZeroNAS is capable of discovering desirable architectures that perform favorably against state-of-the-art ZSL and generalized zero-shot learning (GZSL) approaches. Source code is at https://github.com/caixiay/ZeroNAS.


Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador , Humanos , Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina
9.
IEEE Trans Cybern ; 51(9): 4501-4514, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-31794409

RESUMO

GWI survey1 has highlighted the flourishing use of multiple social networks: the average number of social media accounts per Internet user is 5.54, and among them, 2.82 are being used actively. Indeed, users tend to express their views in more than one social media site. Hence, merging social signals of the same user across different social networks together, if available, can facilitate the downstream analyses. Previous work has paid little attention on modeling the cooperation among the following factors when fusing data from multiple social networks: 1) as data from different sources characterizes the characteristics of the same social user, the source consistency merits our attention; 2) due to their different functional emphases, some aspects of the same user captured by different social networks can be just complementary and results in the source complementarity; and 3) different sources can contribute differently to the user characterization and hence lead to the different source confidence. Toward this end, we propose a novel unified model, which co-regularizes source consistency, complementarity, and confidence to boost the learning performance with multiple social networks. In addition, we derived its theoretical solution and verified the model with the real-world application of user interest inference. Extensive experiments over several state-of-the-art competitors have justified the superiority of our model.1http://tinyurl.com/zk6kgc9.

10.
IEEE Trans Image Process ; 30: 767-782, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33232234

RESUMO

- Action recognition is a popular research topic in the computer vision and machine learning domains. Although many action recognition methods have been proposed, only a few researchers have focused on cross-domain few-shot action recognition, which must often be performed in real security surveillance. Since the problems of action recognition, domain adaptation, and few-shot learning need to be simultaneously solved, the cross-domain few-shot action recognition task is a challenging problem. To solve these issues, in this work, we develop a novel end-to-end pairwise attentive adversarial spatiotemporal network (PASTN) to perform the cross-domain few-shot action recognition task, in which spatiotemporal information acquisition, few-shot learning, and video domain adaptation are realised in a unified framework. Specifically, the Resnet-50 network is selected as the backbone of the PASTN, and a 3D convolution block is embedded in the top layer of the 2D CNN (ResNet-50) to capture the spatiotemporal representations. Moreover, a novel attentive adversarial network architecture is designed to align the spatiotemporal dynamics actions with higher domain discrepancies. In addition, the pairwise margin discrimination loss is designed for the pairwise network architecture to improve the discrimination of the learned domain-invariant spatiotemporal feature. The results of extensive experiments performed on three public benchmarks of the cross-domain action recognition datasets, including SDAI Action I, SDAI Action II and UCF50-OlympicSport, demonstrate that the proposed PASTN can significantly outperform the state-of-the-art cross-domain action recognition methods in terms of both the accuracy and computational time. Even when only two labelled training samples per category are considered in the office1 scenario of the SDAI Action I dataset, the accuracy of the PASTN is improved by 6.1%, 10.9%, 16.8%, and 14% compared to that of the TA3N , TemporalPooling, I3D, and P3D methods, respectively.

11.
IEEE Trans Image Process ; 29: 1-14, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31265394

RESUMO

The prevailing characteristics of micro-videos result in the less descriptive power of each modality. The micro-video representations, several pioneer efforts proposed, are limited in implicitly exploring the consistency between different modality information but ignore the complementarity. In this paper, we focus on how to explicitly separate the consistent features and the complementary features from the mixed information and harness their combination to improve the expressiveness of each modality. Toward this end, we present a neural multimodal cooperative learning (NMCL) model to split the consistent component and the complementary component by a novel relation-aware attention mechanism. Specifically, the computed attention score can be used to measure the correlation between the features extracted from different modalities. Then, a threshold is learned for each modality to distinguish the consistent and complementary features according to the score. Thereafter, we integrate the consistent parts to enhance the representations and supplement the complementary ones to reinforce the information in each modality. As to the problem of redundant information, which may cause overfitting and is hard to distinguish, we devise an attention network to dynamically capture the features which closely related the category and output a discriminative representation for prediction. The experimental results on a real-world micro-video dataset show that the NMCL outperforms the state-of-the-art methods. Further studies verify the effectiveness and cooperative effects brought by the attentive mechanism.


Assuntos
Mineração de Dados/métodos , Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina , Algoritmos , Animais , Cães , Semântica , Gravação em Vídeo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA