Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

Adaptive Relation-Aware Network for zero-shot classification.

Zhang, Xun; Liu, Yang; Dang, Yuhao; Gao, Xinbo; Han, Jungong; Shao, Ling.

Neural Netw ; 174: 106227, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38452663

RESUMO

Supervised learning-based image classification in computer vision relies on visual samples containing a large amount of labeled information. Considering that it is labor-intensive to collect and label images and construct datasets manually, Zero-Shot Learning (ZSL) achieves knowledge transfer from seen categories to unseen categories by mining auxiliary information, which reduces the dependence on labeled image samples and is one of the current research hotspots in computer vision. However, most ZSL methods fail to properly measure the relationships between classes, or do not consider the differences and similarities between classes at all. In this paper, we propose Adaptive Relation-Aware Network (ARAN), a novel ZSL approach that incorporates the improved triplet loss from deep metric learning into a VAE-based generative model, which helps to model inter-class and intra-class relationships for different classes in ZSL datasets and generate an arbitrary amount of high-quality visual features containing more discriminative information. Moreover, we validate the effectiveness and superior performance of our ARAN through experimental evaluations under ZSL and more practical GZSL settings on three popular datasets AWA2, CUB, and SUN.

Assuntos

Conhecimento

2.

Tolerant Self-Distillation for image classification.

Liu, Mushui; Yu, Yunlong; Ji, Zhong; Han, Jungong; Zhang, Zhongfei.

Neural Netw ; 174: 106215, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38471261

RESUMO

Deep neural networks tend to suffer from the overfitting issue when the training data are not enough. In this paper, we introduce two metrics from the intra-class distribution of correct-predicted and incorrect-predicted samples to provide a new perspective on the overfitting issue. Based on it, we propose a knowledge distillation approach without pretraining a teacher model in advance named Tolerant Self-Distillation (TSD) for alleviating the overfitting issue. It introduces an online updating memory and selectively stores the class predictions of the samples from the past iterations, making it possible to distill knowledge across the iterations. Specifically, the class predictions stored in the memory bank serve as the soft labels for supervising the samples from the same class for the current iteration in a reverse way, i.e. the correct-predicted samples are supervised with the incorrect predictions while the incorrect-predicted samples are supervised with the correct predictions. Consequently, the premature convergence issue caused by the over-confident samples would be mitigated, which helps the model to converge to a better local optimum. Extensive experimental results on several image classification benchmarks, including small-scale, large-scale, and fine-grained datasets, demonstrate the superiority of the proposed TSD.

Assuntos

Benchmarking , Conhecimento , Redes Neurais de Computação

3.

Virtual Category Learning: A Semi-Supervised Learning Method for Dense Prediction with Extremely Limited Labels.

Chen, Changrui; Han, Jungong; Debattista, Kurt.

IEEE Trans Pattern Anal Mach Intell ; PP2024 Feb 20.

Artigo em Inglês | MEDLINE | ID: mdl-38376969

RESUMO

Due to the costliness of labelled data in real-world applications, semi-supervised learning, underpinned by pseudo labelling, is an appealing solution. However, handling confusing samples is nontrivial: discarding valuable confusing samples would compromise the model generalisation while using them for training would exacerbate the issue of confirmation bias caused by the resulting inevitable mislabelling. To solve this problem, this paper proposes to use confusing samples proactively without label correction. Specifically, a Virtual Category (VC) is assigned to each confusing sample in such a way that it can safely contribute to the model optimisation even without a concrete label. This provides an upper bound for inter-class information sharing capacity, which eventually leads to a better embedding space. Extensive experiments on two mainstream dense prediction tasks - semantic segmentation and object detection, demonstrate that the proposed VC learning significantly surpasses the state-of-the-art, especially when only very few labels are available. Our intriguing findings highlight the usage of VC learning in dense vision tasks.

4.

Imbalance Mitigation for Continual Learning via Knowledge Decoupling and Dual Enhanced Contrastive Learning.

Ji, Zhong; Jiao, Zhanyu; Wang, Qiang; Pang, Yanwei; Han, Jungong.

IEEE Trans Neural Netw Learn Syst ; PP2024 Jan 08.

Artigo em Inglês | MEDLINE | ID: mdl-38190680

RESUMO

Continual learning (CL) aims at studying how to learn new knowledge continuously from data streams without catastrophically forgetting the previous knowledge. One of the key problems is catastrophic forgetting, that is, the performance of the model on previous tasks declines significantly after learning the subsequent task. Several studies addressed it by replaying samples stored in the buffer when training new tasks. However, the data imbalance between old and new task samples results in two serious problems: information suppression and weak feature discriminability. The former refers to the information in the sufficient new task samples suppressing that in the old task samples, which is harmful to maintaining the knowledge since the biased output worsens the consistency of the same sample's output at different moments. The latter refers to the feature representation being biased to the new task, which lacks discrimination to distinguish both old and new tasks. To this end, we build an imbalance mitigation for CL (IMCL) framework that incorporates a decoupled knowledge distillation (DKD) approach and a dual enhanced contrastive learning (DECL) approach to tackle both problems. Specifically, the DKD approach alleviates the suppression of the new task on the old tasks by decoupling the model output probability during the replay stage, which better maintains the knowledge of old tasks. The DECL approach enhances both low-and high-level features and fuses the enhanced features to construct contrastive loss to effectively distinguish different tasks. Extensive experiments on three popular datasets show that our method achieves promising performance under task incremental learning (Task-IL), class incremental learning (Class-IL), and domain incremental learning (Domain-IL) settings.

5.

Capsule Networks With Residual Pose Routing.

Liu, Yi; Cheng, De; Zhang, Dingwen; Xu, Shoukun; Han, Jungong.

IEEE Trans Neural Netw Learn Syst ; PP2024 Jan 09.

Artigo em Inglês | MEDLINE | ID: mdl-38194388

RESUMO

Capsule networks (CapsNets) have been known difficult to develop a deeper architecture, which is desirable for high performance in the deep learning era, due to the complex capsule routing algorithms. In this article, we present a simple yet effective capsule routing algorithm, which is presented by a residual pose routing. Specifically, the higher-layer capsule pose is achieved by an identity mapping on the adjacently lower-layer capsule pose. Such simple residual pose routing has two advantages: 1) reducing the routing computation complexity and 2) avoiding gradient vanishing due to its residual learning framework. On top of that, we explicitly reformulate the capsule layers by building a residual pose block. Stacking multiple such blocks results in a deep residual CapsNets (ResCaps) with a ResNet-like architecture. Results on MNIST, AffNIST, SmallNORB, and CIFAR-10/100 show the effectiveness of ResCaps for image classification. Furthermore, we successfully extend our residual pose routing to large-scale real-world applications, including 3-D object reconstruction and classification, and 2-D saliency dense prediction. The source code has been released on https://github.com/liuyi1989/ResCaps.

6.

Structural and functional connectivity of the whole brain and subnetworks in individuals with mild traumatic brain injury: predictors of patient prognosis.

Huang, Sihong; Han, Jungong; Zheng, Hairong; Li, Mengjun; Huang, Chuxin; Kui, Xiaoyan; Liu, Jun.

Neural Regen Res ; 19(7): 1553-1558, 2024 Jul 01.

Artigo em Inglês | MEDLINE | ID: mdl-38051899

7.

Balancing Feature Alignment and Uniformity for Few-Shot Classification.

Yu, Yunlong; Zhang, Dingyi; Ji, Zhong; Li, Xi; Han, Jungong; Zhang, Zhongfei.

IEEE Trans Image Process ; PP2023 Nov 03.

Artigo em Inglês | MEDLINE | ID: mdl-37922165

RESUMO

In Few-Shot Learning (FSL), the objective is to correctly recognize new samples from novel classes with only a few available samples per class. Existing methods in FSL primarily focus on learning transferable knowledge from base classes by maximizing the information between feature representations and their corresponding labels. However, this approach may suffer from the "supervision collapse" issue, which arises due to a bias towards the base classes. In this paper, we propose a solution to address this issue by preserving the intrinsic structure of the data and enabling the learning of a generalized model for the novel classes. Following the InfoMax principle, our approach maximizes two types of mutual information (MI): between the samples and their feature representations, and between the feature representations and their class labels. This allows us to strike a balance between discrimination (capturing class-specific information) and generalization (capturing common characteristics across different classes) in the feature representations. To achieve this, we adopt a unified framework that perturbs the feature embedding space using two low-bias estimators. The first estimator maximizes the MI between a pair of intra-class samples, while the second estimator maximizes the MI between a sample and its augmented views. This framework effectively combines knowledge distillation between class-wise pairs and enlarges the diversity in feature representations. By conducting extensive experiments on popular FSL benchmarks, our proposed approach achieves comparable performances with state-of-the-art competitors. For example, we achieved an accuracy of 69.53% on the miniImageNet dataset and 77.06% on the CIFAR-FS dataset for the 5-way 1-shot task.

8.

Manipulating Identical Filter Redundancy for Efficient Pruning on Deep and Complicated CNN.

Hao, Tianxiang; Ding, Xiaohan; Han, Jungong; Guo, Yuchen; Ding, Guiguang.

IEEE Trans Neural Netw Learn Syst ; PP2023 Oct 12.

Artigo em Inglês | MEDLINE | ID: mdl-37824319

RESUMO

The existence of redundancy in convolutional neural networks (CNNs) enables us to remove some filters/channels with acceptable performance drops. However, the training objective of CNNs usually tends to minimize an accuracy-related loss function without any attention paid to the redundancy, making the redundancy distribute randomly on all the filters, such that removing any of them may trigger information loss and accuracy drop, necessitating a fine-tuning step for recovery. In this article, we propose to manipulate the redundancy during training to facilitate network pruning. To this end, we propose a novel centripetal SGD (C-SGD) to make some filters identical, resulting in ideal redundancy patterns, as such filters become purely redundant due to their duplicates, hence removing them does not harm the network. As shown on CIFAR and ImageNet, C-SGD delivers better performance because the redundancy is better organized, compared to the existing methods. The efficiency also characterizes C-SGD because it is as fast as regular SGD, requires no fine-tuning, and can be conducted simultaneously on all the layers even in very deep CNNs. Besides, C-SGD can improve the accuracy of CNNs by first training a model with the same architecture but wider layers and then squeezing it into the original width.

9.

Mitigating Modality Discrepancies for RGB-T Semantic Segmentation.

Zhao, Shenlu; Liu, Yichen; Jiao, Qiang; Zhang, Qiang; Han, Jungong.

IEEE Trans Neural Netw Learn Syst ; PP2023 Jan 06.

Artigo em Inglês | MEDLINE | ID: mdl-37018600

RESUMO

Semantic segmentation models gain robustness against adverse illumination conditions by taking advantage of complementary information from visible and thermal infrared (RGB-T) images. Despite its importance, most existing RGB-T semantic segmentation models directly adopt primitive fusion strategies, such as elementwise summation, to integrate multimodal features. Such strategies, unfortunately, overlook the modality discrepancies caused by inconsistent unimodal features obtained by two independent feature extractors, thus hindering the exploitation of cross-modal complementary information within the multimodal data. For that, we propose a novel network for RGB-T semantic segmentation, i.e. MDRNet + , which is an improved version of our previous work ABMDRNet. The core of MDRNet + is a brand new idea, termed the strategy of bridging-then-fusing, which mitigates modality discrepancies before cross-modal feature fusion. Concretely, an improved Modality Discrepancy Reduction (MDR + ) subnetwork is designed, which first extracts unimodal features and reduces their modality discrepancies. Afterward, discriminative multimodal features for RGB-T semantic segmentation are adaptively selected and integrated via several channel-weighted fusion (CWF) modules. Furthermore, a multiscale spatial context (MSC) module and a multiscale channel context (MCC) module are presented to effectively capture the contextual information. Finally, we elaborately assemble a challenging RGB-T semantic segmentation dataset, i.e., RTSS, for urban scene understanding to mitigate the lack of well-annotated training data. Comprehensive experiments demonstrate that our proposed model surpasses other state-of-the-art models on the MFNet, PST900, and RTSS datasets remarkably.

10.

Semi-Supervised Unpaired Medical Image Segmentation Through Task-Affinity Consistency.

Chen, Jingkun; Zhang, Jianguo; Debattista, Kurt; Han, Jungong.

IEEE Trans Med Imaging ; 42(3): 594-605, 2023 03.

Artigo em Inglês | MEDLINE | ID: mdl-36219664

RESUMO

Deep learning-based semi-supervised learning (SSL) algorithms are promising in reducing the cost of manual annotation of clinicians by using unlabelled data, when developing medical image segmentation tools. However, to date, most existing semi-supervised learning (SSL) algorithms treat the labelled images and unlabelled images separately and ignore the explicit connection between them; this disregards essential shared information and thus hinders further performance improvements. To mine the shared information between the labelled and unlabelled images, we introduce a class-specific representation extraction approach, in which a task-affinity module is specifically designed for representation extraction. We further cast the representation into two different views of feature maps; one is focusing on low-level context, while the other concentrates on structural information. The two views of feature maps are incorporated into the task-affinity module, which then extracts the class-specific representations to aid the knowledge transfer from the labelled images to the unlabelled images. In particular, a task-affinity consistency loss between the labelled images and unlabelled images based on the multi-scale class-specific representations is formulated, leading to a significant performance improvement. Experimental results on three datasets show that our method consistently outperforms existing state-of-the-art methods. Our findings highlight the potential of consistency between class-specific knowledge for semi-supervised medical image segmentation. The code and models are to be made publicly available at https://github.com/jingkunchen/TAC.

Assuntos

Algoritmos , Aprendizado de Máquina Supervisionado

11.

Hierarchical Regression and Classification for Accurate Object Detection.

Cao, Jiale; Pang, Yanwei; Han, Jungong; Li, Xuelong.

IEEE Trans Neural Netw Learn Syst ; 34(5): 2425-2439, 2023 May.

Artigo em Inglês | MEDLINE | ID: mdl-34695000

RESUMO

Accurate object detection requires correct classification and high-quality localization. Currently, most of the single shot detectors (SSDs) conduct simultaneous classification and regression using a fully convolutional network. Despite high efficiency, this structure has some inappropriate designs for accurate object detection. The first one is the mismatch of bounding box classification, where the classification results of the default bounding boxes are improperly treated as the results of the regressed bounding boxes during the inference. The second one is that only one-time regression is not good enough for high-quality object localization. To solve the problem of classification mismatch, we propose a novel reg-offset-cls (ROC) module including three hierarchical steps: the regression of the default bounding box, the prediction of new feature sampling locations, and the classification of the regressed bounding box with more accurate features. For high-quality localization, we stack two ROC modules together. The input of the second ROC module is the output of the first ROC module. In addition, we inject a feature enhanced (FE) module between two stacked ROC modules to extract more contextual information. The experiments on three different datasets (i.e., MS COCO, PASCAL VOC, and UAVDT) are performed to demonstrate the effectiveness and superiority of our method. Without any bells or whistles, our proposed method outperforms state-of-the-art one-stage methods at a real-time speed. The source code is available at https://github.com/JialeCao001/HSD.

12.

Knowledge Distillation Classifier Generation Network for Zero-Shot Learning.

Yu, Yunlong; Li, Bin; Ji, Zhong; Han, Jungong; Zhang, Zhongfei.

IEEE Trans Neural Netw Learn Syst ; 34(6): 3183-3194, 2023 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-34587096

RESUMO

In this article, we present a conceptually simple but effective framework called knowledge distillation classifier generation network (KDCGN) for zero-shot learning (ZSL), where the learning agent requires recognizing unseen classes that have no visual data for training. Different from the existing generative approaches that synthesize visual features for unseen classifiers' learning, the proposed framework directly generates classifiers for unseen classes conditioned on the corresponding class-level semantics. To ensure the generated classifiers to be discriminative to the visual features, we borrow the knowledge distillation idea to both supervise the classifier generation and distill the knowledge with, respectively, the visual classifiers and soft targets trained from a traditional classification network. Under this framework, we develop two, respectively, strategies, i.e., class augmentation and semantics guidance, to facilitate the supervision process from the perspectives of improving visual classifiers. Specifically, the class augmentation strategy incorporates some additional categories to train the visual classifiers, which regularizes the visual classifier weights to be compact, under supervision of which the generated classifiers will be more discriminative. The semantics-guidance strategy encodes the class semantics into the visual classifiers, which would facilitate the supervision process by minimizing the differences between the generated and the real-visual classifiers. To evaluate the effectiveness of the proposed framework, we have conducted extensive experiments on five datasets in image classification, i.e., AwA1, AwA2, CUB, FLO, and APY. Experimental results show that the proposed approach performs best in the traditional ZSL task and achieves a significant performance improvement on four out of the five datasets in the generalized ZSL task.

13.

A Discriminative Cross-Aligned Variational Autoencoder for Zero-Shot Learning.

Liu, Yang; Gao, Xinbo; Han, Jungong; Shao, Ling.

IEEE Trans Cybern ; 53(6): 3794-3805, 2023 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-35468070

RESUMO

Zero-shot learning (ZSL) aims to classify unseen samples based on the relationship between the learned visual features and semantic features. Traditional ZSL methods typically capture the underlying multimodal data structures by learning an embedding function between the visual space and the semantic space with the Euclidean metric. However, these models suffer from the hubness problem and domain bias problem, which leads to unsatisfactory performance, especially in the generalized ZSL (GZSL) task. To tackle such a problem, we formulate a discriminative cross-aligned variational autoencoder (DCA-VAE) for ZSL. The proposed model effectively utilizes a modified cross-modal-alignment variational autoencoder (VAE) to transform both visual features and semantic features obtained by the discriminative cosine metric into latent features. The key to our method is that we collect principal discriminative information from visual and semantic features to construct latent features which contain the discriminative multimodal information associated with unseen samples. Finally, the proposed model DCA-VAE is validated on six benchmarks including the large dataset ImageNet, and several experimental results demonstrate the superiority of DCA-VAE over most existing embedding or generative ZSL models on the standard ZSL and the more realistic GZSL tasks.

14.

Connectome-based predictive modelling can predict follow-up craving after abstinence in individuals with opioid use disorders.

Yang, Wenhan; Han, Jungong; Luo, Jing; Tang, Fei; Fan, Li; Du, Yanyao; Yang, Longtao; Zhang, Jun; Zhang, Huiting; Liu, Jun.

Gen Psychiatr ; 36(6): e101304, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38169807

RESUMO

Background: Individual differences have been detected in individuals with opioid use disorders (OUD) in rehabilitation following protracted abstinence. Recent studies suggested that prediction models were effective for individual-level prognosis based on neuroimage data in substance use disorders (SUD). Aims: This prospective cohort study aimed to assess neuroimaging biomarkers for individual response to protracted abstinence in opioid users using connectome-based predictive modelling (CPM). Methods: One hundred and eight inpatients with OUD underwent structural and functional magnetic resonance imaging (fMRI) scans at baseline. The Heroin Craving Questionnaire (HCQ) was used to assess craving levels at baseline and at the 8-month follow-up of abstinence. CPM with leave-one-out cross-validation was used to identify baseline networks that could predict follow-up HCQ scores and changes in HCQ (HCQfollow-up-HCQbaseline). Then, the predictive ability of identified networks was tested in a separate, heterogeneous sample of methamphetamine individuals who underwent MRI scanning before abstinence for SUD. Results: CPM could predict craving changes induced by long-term abstinence, as shown by a significant correlation between predicted and actual HCQfollow-up (r=0.417, p<0.001) and changes in HCQ (negative: r=0.334, p=0.002ï¼positive: r=0.233, p=0.038). Identified craving-related prediction networks included the somato-motor network (SMN), salience network (SALN), default mode network (DMN), medial frontal network, visual network and auditory network. In addition, decreased connectivity of frontal-parietal network (FPN)-SMN, FPN-DMN and FPN-SALN and increased connectivity of subcortical network (SCN)-DMN, SCN-SALN and SCN-SMN were positively correlated with craving levels. Conclusions: These findings highlight the potential applications of CPM to predict the craving level of individuals after protracted abstinence, as well as the generalisation ability; the identified brain networks might be the focus of innovative therapies in the future.

15.

Disentangled Capsule Routing for Fast Part-Object Relational Saliency.

Liu, Yi; Zhang, Dingwen; Liu, Nian; Xu, Shoukun; Han, Jungong.

IEEE Trans Image Process ; 31: 6719-6732, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36282823

RESUMO

Recently, the Part-Object Relational (POR) saliency underpinned by the Capsule Network (CapsNet) has been demonstrated to be an effective modeling mechanism to improve the saliency detection accuracy. However, it is widely known that the current capsule routing operations have huge computational complexity, which seriously limited the usability of the POR saliency models in real-time applications. To this end, this paper takes an early step towards a fast POR saliency inference by proposing a novel disentangled part-object relational network. Concretely, we disentangle horizontal routing and vertical routing from the original omnidirectional capsule routing, thus generating Disentangled Capsule Routing (DCR). This mechanism enjoys two advantages. On one hand, DCR that disentangles orthogonal 1D (i.e., vertical and horizontal) routing greatly reduces parameters and routing complexity, resulting in much faster inference than omnidirectional 2D routing adopted by existing CapsNets. On the other hand, thanks to the light POR cues explored by DCR, we could conveniently integrate the part-object routing process to different feature layers in CNN, rather than just applying it to the small-scaled one as in previous works. This helps to increase saliency inference accuracy. Compared to previous POR saliency detectors, DPORTNet infers visual saliency (5 â¼ 9 ) × faster, and is more accurate. DPORTNet is available under the open-source license at https://github.com/liuyi1989/DCR.

16.

Middle-Level Feature Fusion for Lightweight RGB-D Salient Object Detection.

Huang, Nianchang; Jiao, Qiang; Zhang, Qiang; Han, Jungong.

IEEE Trans Image Process ; 31: 6621-6634, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36256711

RESUMO

Most existing RGB-D salient object detection (SOD) models adopt a two-stream structure to extract the information from the input RGB and depth images. Since they use two subnetworks for unimodal feature extraction and multiple multi-modal feature fusion modules for extracting cross-modal complementary information, these models require a huge number of parameters, thus hindering their real-life applications. To remedy this situation, we propose a novel middle-level feature fusion structure that allows to design a lightweight RGB-D SOD model. Specifically, the proposed structure first employs two shallow subnetworks to extract low- and middle-level unimodal RGB and depth features, respectively. Afterward, instead of integrating middle-level unimodal features multiple times at different layers, we just fuse them once via a specially designed fusion module. On top of that, high-level multi-modal semantic features are further extracted for final salient object detection via an additional subnetwork. This will greatly reduce the network's parameters. Moreover, to compensate for the performance loss due to parameter deduction, a relation-aware multi-modal feature fusion module is specially designed to effectively capture the cross-modal complementary information during the fusion of middle-level multi-modal features. By enabling the feature-level and decision-level information to interact, we maximize the usage of the fused cross-modal middle-level features and the extracted cross-modal high-level features for saliency prediction. Experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods. Remarkably, our proposed model has only 3.9M parameters and runs at 33 FPS.

17.

Zero-Shot Learning With Attentive Region Embedding and Enhanced Semantics.

Liu, Yang; Dang, Yuhao; Gao, Xinbo; Han, Jungong; Shao, Ling.

IEEE Trans Neural Netw Learn Syst ; PP2022 Sep 07.

Artigo em Inglês | MEDLINE | ID: mdl-36070273

RESUMO

The performance of zero-shot learning (ZSL) can be improved progressively by learning better features and generating pseudosamples for unseen classes. Existing ZSL works typically learn feature extractors and generators independently, which may shift the unseen samples away from their real distribution and suffers from the domain bias problem. In this article, to tackle this challenge, we propose a variational autoencoder (VAE)-based framework, that is, joint Attentive Region Embedding with Enhanced Semantics (AREES), which is tailored to advance the zero-shot recognition. Specifically, AREES is end-to-end trainable and consists of three network branches: 1) attentive region embedding is used to learn the semantic-guided visual features by the attention mechanism (AM); 2) a decomposition structure and a semantic pivot regularization are used to extract enhanced semantics; and 3) a multimodal VAE (mVAE) with the cross-reconstruction loss and the distribution alignment loss is used to obtain a shared latent embedding space of visual features and semantics. Finally, features' extraction and features' generation are optimized together in AREES to address the domain shift problem to a large extent. The comprehensive evaluations on six benchmarks, including the ImageNet, demonstrate the superiority of the proposed model over its state-of-the-art counterparts.

18.

Region-Object Relation-Aware Dense Captioning via Transformer.

Shao, Zhuang; Han, Jungong; Marnerides, Demetris; Debattista, Kurt.

IEEE Trans Neural Netw Learn Syst ; PP2022 Mar 11.

Artigo em Inglês | MEDLINE | ID: mdl-35275824

RESUMO

Dense captioning provides detailed captions of complex visual scenes. While a number of successes have been achieved in recent years, there are still two broad limitations: 1) most existing methods adopt an encoder-decoder framework, where the contextual information is sequentially encoded using long short-term memory (LSTM). However, the forget gate mechanism of LSTM makes it vulnerable when dealing with a long sequence and 2) the vast majority of prior arts consider regions of interests (RoIs) equally important, thus failing to focus on more informative regions. The consequence is that the generated captions cannot highlight important contents of the image, which does not seem natural. To overcome these limitations, in this article, we propose a novel end-to-end transformer-based dense image captioning architecture, termed the transformer-based dense captioner (TDC). TDC learns the mapping between images and their dense captions via a transformer, prioritizing more informative regions. To this end, we present a novel unit, named region-object correlation score unit (ROCSU), to measure the importance of each region, where the relationships between detected objects and the region, alongside the confidence scores of detected objects within the region, are taken into account. Extensive experimental results and ablation studies on the standard dense-captioning datasets demonstrate the superiority of the proposed method to the state-of-the-art methods.

19.

Information Symmetry Matters: A Modal-Alternating Propagation Network for Few-Shot Learning.

Ji, Zhong; Hou, Zhishen; Liu, Xiyao; Pang, Yanwei; Han, Jungong.

IEEE Trans Image Process ; 31: 1520-1531, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35050856

RESUMO

Semantic information provides intra-class consistency and inter-class discriminability beyond visual concepts, which has been employed in Few-Shot Learning (FSL) to achieve further gains. However, semantic information is only available for labeled samples but absent for unlabeled samples, in which the embeddings are rectified unilaterally by guiding the few labeled samples with semantics. Therefore, it is inevitable to bring a cross-modal bias between semantic-guided samples and nonsemantic-guided samples, which results in an information asymmetry problem. To address this problem, we propose a Modal-Alternating Propagation Network (MAP-Net) to supplement the absent semantic information of unlabeled samples, which builds information symmetry among all samples in both visual and semantic modalities. Specifically, the MAP-Net transfers the neighbor information by the graph propagation to generate the pseudo-semantics for unlabeled samples guided by the completed visual relationships and rectify the feature embeddings. In addition, due to the large discrepancy between visual and semantic modalities, we design a Relation Guidance (RG) strategy to guide the visual relation vectors via semantics so that the propagated information is more beneficial. Extensive experimental results on three semantic-labeled datasets, i.e., Caltech-UCSD-Birds 200-2011, SUN Attribute Database and Oxford 102 Flower, have demonstrated that our proposed method achieves promising performance and outperforms the state-of-the-art approaches, which indicates the necessity of information symmetry.

20.

SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval.

Ji, Zhong; Wang, Haoran; Han, Jungong; Pang, Yanwei.

IEEE Trans Cybern ; 52(2): 1086-1097, 2022 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-32386178

RESUMO

This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods.

Assuntos

Processamento de Linguagem Natural

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA