Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 63
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Image Process ; 33: 4419-4431, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39088502

RESUMO

Few-Shot Class-Incremental Learning (FSCIL) aims at incrementally learning new knowledge from limited training examples without forgetting previous knowledge. However, we observe that existing methods face a challenge known as supervision collapse, where the model disproportionately emphasizes class-specific features of base classes at the detriment of novel class representations, leading to restricted cognitive capabilities. To alleviate this issue, we propose a new framework, Model aTtention Expansion for Few-Shot Class-Incremental Learning (MTE-FSCIL), aimed at expanding the model attention fields to improve transferability without compromising the discriminative capability for base classes. Specifically, the framework adopts a dual-stage training strategy, comprising pre-training and meta-training stages. In the pre-training stage, we present a new regularization technique, named the Reserver (RS) loss, to expand the global perception and reduce over-reliance on class-specific features by amplifying feature map activations. During the meta-training stage, we introduce the Repeller (RP) loss, a novel pair-based loss that promotes variation in representations and improves the model's recognition of sample uniqueness by scattering intra-class samples within the embedding space. Furthermore, we propose a Transformational Adaptation (TA) strategy to enable continuous incorporation of new knowledge from downstream tasks, thus facilitating cross-task knowledge transfer. Extensive experimental results on mini-ImageNet, CIFAR100, and CUB200 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art methods.

2.
Artigo em Inglês | MEDLINE | ID: mdl-39093671

RESUMO

Recently, fast Magnetic Resonance Imaging reconstruction technology has emerged as a promising way to improve the clinical diagnostic experience by significantly reducing scan times. While existing studies have used Generative Adversarial Networks to achieve impressive results in reconstructing MR images, they still suffer from challenges such as blurred zones/boundaries and abnormal spots caused by inevitable noise in the reconstruction process. To this end, we propose a novel deep framework termed Anisotropic Diffusion-Assisted Generative Adversarial Networks, which aims to maximally preserve valid high-frequency information and structural details while minimizing noises in reconstructed images by optimizing a joint loss function in a unified framework. In doing so, it enables more authentic and accurate MR image generation. To specifically handle unforeseeable noises, an Anisotropic Diffused Reconstruction Module is developed and added aside the backbone network as a denoise assistant, which improves the final image quality by minimizing reconstruction losses between targets and iteratively denoised generative outputs with no extra computational complexity during the testing phase. To make the most of valuable MRI data, we extend its application to support multi-modal learning to boost reconstructed image quality by aggregating more valid information from images of diverse modalities. Extensive experiments on public datasets show that the proposed framework can achieve superior performance in polishing up the quality of reconstructed MR images. For example, the proposed method obtains average PSNR and mSSIM values of 35.785dB and 0.9765 on the MRNet dataset, which are at least about 2.9dB and 0.07 higher than those from the baselines.

3.
IEEE Trans Image Process ; 33: 4303-4318, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39028600

RESUMO

In RGB-T tracking, there exist rich spatial relationships between the target and backgrounds within multi-modal data as well as sound consistencies of spatial relationships among successive frames, which are crucial for boosting the tracking performance. However, most existing RGB-T trackers overlook such multi-modal spatial relationships and temporal consistencies within RGB-T videos, hindering them from robust tracking and practical applications in complex scenarios. In this paper, we propose a novel Multi-modal Spatial-Temporal Context (MMSTC) network for RGB-T tracking, which employs a Transformer architecture for the construction of reliable multi-modal spatial context information and the effective propagation of temporal context information. Specifically, a Multi-modal Transformer Encoder (MMTE) is designed to achieve the encoding of reliable multi-modal spatial contexts as well as the fusion of multi-modal features. Furthermore, a Quality-aware Transformer Decoder (QATD) is proposed to effectively propagate the tracking cues from historical frames to the current frame, which facilitates the object searching process. Moreover, the proposed MMSTC network can be easily extended to various tracking frameworks. New state-of-the-art results on five prevalent RGB-T tracking benchmarks demonstrate the superiorities of our proposed trackers over existing ones.

4.
Neural Netw ; 174: 106215, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38471261

RESUMO

Deep neural networks tend to suffer from the overfitting issue when the training data are not enough. In this paper, we introduce two metrics from the intra-class distribution of correct-predicted and incorrect-predicted samples to provide a new perspective on the overfitting issue. Based on it, we propose a knowledge distillation approach without pretraining a teacher model in advance named Tolerant Self-Distillation (TSD) for alleviating the overfitting issue. It introduces an online updating memory and selectively stores the class predictions of the samples from the past iterations, making it possible to distill knowledge across the iterations. Specifically, the class predictions stored in the memory bank serve as the soft labels for supervising the samples from the same class for the current iteration in a reverse way, i.e. the correct-predicted samples are supervised with the incorrect predictions while the incorrect-predicted samples are supervised with the correct predictions. Consequently, the premature convergence issue caused by the over-confident samples would be mitigated, which helps the model to converge to a better local optimum. Extensive experimental results on several image classification benchmarks, including small-scale, large-scale, and fine-grained datasets, demonstrate the superiority of the proposed TSD.


Assuntos
Benchmarking , Conhecimento , Redes Neurais de Computação
5.
Neural Netw ; 174: 106227, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38452663

RESUMO

Supervised learning-based image classification in computer vision relies on visual samples containing a large amount of labeled information. Considering that it is labor-intensive to collect and label images and construct datasets manually, Zero-Shot Learning (ZSL) achieves knowledge transfer from seen categories to unseen categories by mining auxiliary information, which reduces the dependence on labeled image samples and is one of the current research hotspots in computer vision. However, most ZSL methods fail to properly measure the relationships between classes, or do not consider the differences and similarities between classes at all. In this paper, we propose Adaptive Relation-Aware Network (ARAN), a novel ZSL approach that incorporates the improved triplet loss from deep metric learning into a VAE-based generative model, which helps to model inter-class and intra-class relationships for different classes in ZSL datasets and generate an arbitrary amount of high-quality visual features containing more discriminative information. Moreover, we validate the effectiveness and superior performance of our ARAN through experimental evaluations under ZSL and more practical GZSL settings on three popular datasets AWA2, CUB, and SUN.

6.
IEEE Trans Pattern Anal Mach Intell ; 46(8): 5595-5611, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38376969

RESUMO

Due to the costliness of labelled data in real-world applications, semi-supervised learning, underpinned by pseudo labelling, is an appealing solution. However, handling confusing samples is nontrivial: discarding valuable confusing samples would compromise the model generalisation while using them for training would exacerbate the issue of confirmation bias caused by the resulting inevitable mislabelling. To solve this problem, this paper proposes to use confusing samples proactively without label correction. Specifically, a Virtual Category (VC) is assigned to each confusing sample in such a way that it can safely contribute to the model optimisation even without a concrete label. This provides an upper bound for inter-class information sharing capacity, which eventually leads to a better embedding space. Extensive experiments on two mainstream dense prediction tasks - semantic segmentation and object detection, demonstrate that the proposed VC learning significantly surpasses the state-of-the-art, especially when only very few labels are available. Our intriguing findings highlight the usage of VC learning in dense vision tasks.

7.
Artigo em Inglês | MEDLINE | ID: mdl-38190680

RESUMO

Continual learning (CL) aims at studying how to learn new knowledge continuously from data streams without catastrophically forgetting the previous knowledge. One of the key problems is catastrophic forgetting, that is, the performance of the model on previous tasks declines significantly after learning the subsequent task. Several studies addressed it by replaying samples stored in the buffer when training new tasks. However, the data imbalance between old and new task samples results in two serious problems: information suppression and weak feature discriminability. The former refers to the information in the sufficient new task samples suppressing that in the old task samples, which is harmful to maintaining the knowledge since the biased output worsens the consistency of the same sample's output at different moments. The latter refers to the feature representation being biased to the new task, which lacks discrimination to distinguish both old and new tasks. To this end, we build an imbalance mitigation for CL (IMCL) framework that incorporates a decoupled knowledge distillation (DKD) approach and a dual enhanced contrastive learning (DECL) approach to tackle both problems. Specifically, the DKD approach alleviates the suppression of the new task on the old tasks by decoupling the model output probability during the replay stage, which better maintains the knowledge of old tasks. The DECL approach enhances both low-and high-level features and fuses the enhanced features to construct contrastive loss to effectively distinguish different tasks. Extensive experiments on three popular datasets show that our method achieves promising performance under task incremental learning (Task-IL), class incremental learning (Class-IL), and domain incremental learning (Domain-IL) settings.

8.
Artigo em Inglês | MEDLINE | ID: mdl-38194388

RESUMO

Capsule networks (CapsNets) have been known difficult to develop a deeper architecture, which is desirable for high performance in the deep learning era, due to the complex capsule routing algorithms. In this article, we present a simple yet effective capsule routing algorithm, which is presented by a residual pose routing. Specifically, the higher-layer capsule pose is achieved by an identity mapping on the adjacently lower-layer capsule pose. Such simple residual pose routing has two advantages: 1) reducing the routing computation complexity and 2) avoiding gradient vanishing due to its residual learning framework. On top of that, we explicitly reformulate the capsule layers by building a residual pose block. Stacking multiple such blocks results in a deep residual CapsNets (ResCaps) with a ResNet-like architecture. Results on MNIST, AffNIST, SmallNORB, and CIFAR-10/100 show the effectiveness of ResCaps for image classification. Furthermore, we successfully extend our residual pose routing to large-scale real-world applications, including 3-D object reconstruction and classification, and 2-D saliency dense prediction. The source code has been released on https://github.com/liuyi1989/ResCaps.

10.
Artigo em Inglês | MEDLINE | ID: mdl-37922165

RESUMO

In Few-Shot Learning (FSL), the objective is to correctly recognize new samples from novel classes with only a few available samples per class. Existing methods in FSL primarily focus on learning transferable knowledge from base classes by maximizing the information between feature representations and their corresponding labels. However, this approach may suffer from the "supervision collapse" issue, which arises due to a bias towards the base classes. In this paper, we propose a solution to address this issue by preserving the intrinsic structure of the data and enabling the learning of a generalized model for the novel classes. Following the InfoMax principle, our approach maximizes two types of mutual information (MI): between the samples and their feature representations, and between the feature representations and their class labels. This allows us to strike a balance between discrimination (capturing class-specific information) and generalization (capturing common characteristics across different classes) in the feature representations. To achieve this, we adopt a unified framework that perturbs the feature embedding space using two low-bias estimators. The first estimator maximizes the MI between a pair of intra-class samples, while the second estimator maximizes the MI between a sample and its augmented views. This framework effectively combines knowledge distillation between class-wise pairs and enlarges the diversity in feature representations. By conducting extensive experiments on popular FSL benchmarks, our proposed approach achieves comparable performances with state-of-the-art competitors. For example, we achieved an accuracy of 69.53% on the miniImageNet dataset and 77.06% on the CIFAR-FS dataset for the 5-way 1-shot task.

11.
Artigo em Inglês | MEDLINE | ID: mdl-37824319

RESUMO

The existence of redundancy in convolutional neural networks (CNNs) enables us to remove some filters/channels with acceptable performance drops. However, the training objective of CNNs usually tends to minimize an accuracy-related loss function without any attention paid to the redundancy, making the redundancy distribute randomly on all the filters, such that removing any of them may trigger information loss and accuracy drop, necessitating a fine-tuning step for recovery. In this article, we propose to manipulate the redundancy during training to facilitate network pruning. To this end, we propose a novel centripetal SGD (C-SGD) to make some filters identical, resulting in ideal redundancy patterns, as such filters become purely redundant due to their duplicates, hence removing them does not harm the network. As shown on CIFAR and ImageNet, C-SGD delivers better performance because the redundancy is better organized, compared to the existing methods. The efficiency also characterizes C-SGD because it is as fast as regular SGD, requires no fine-tuning, and can be conducted simultaneously on all the layers even in very deep CNNs. Besides, C-SGD can improve the accuracy of CNNs by first training a model with the same architecture but wider layers and then squeezing it into the original width.

12.
Artigo em Inglês | MEDLINE | ID: mdl-37018600

RESUMO

Semantic segmentation models gain robustness against adverse illumination conditions by taking advantage of complementary information from visible and thermal infrared (RGB-T) images. Despite its importance, most existing RGB-T semantic segmentation models directly adopt primitive fusion strategies, such as elementwise summation, to integrate multimodal features. Such strategies, unfortunately, overlook the modality discrepancies caused by inconsistent unimodal features obtained by two independent feature extractors, thus hindering the exploitation of cross-modal complementary information within the multimodal data. For that, we propose a novel network for RGB-T semantic segmentation, i.e. MDRNet + , which is an improved version of our previous work ABMDRNet. The core of MDRNet + is a brand new idea, termed the strategy of bridging-then-fusing, which mitigates modality discrepancies before cross-modal feature fusion. Concretely, an improved Modality Discrepancy Reduction (MDR + ) subnetwork is designed, which first extracts unimodal features and reduces their modality discrepancies. Afterward, discriminative multimodal features for RGB-T semantic segmentation are adaptively selected and integrated via several channel-weighted fusion (CWF) modules. Furthermore, a multiscale spatial context (MSC) module and a multiscale channel context (MCC) module are presented to effectively capture the contextual information. Finally, we elaborately assemble a challenging RGB-T semantic segmentation dataset, i.e., RTSS, for urban scene understanding to mitigate the lack of well-annotated training data. Comprehensive experiments demonstrate that our proposed model surpasses other state-of-the-art models on the MFNet, PST900, and RTSS datasets remarkably.

13.
IEEE Trans Cybern ; 53(6): 3794-3805, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35468070

RESUMO

Zero-shot learning (ZSL) aims to classify unseen samples based on the relationship between the learned visual features and semantic features. Traditional ZSL methods typically capture the underlying multimodal data structures by learning an embedding function between the visual space and the semantic space with the Euclidean metric. However, these models suffer from the hubness problem and domain bias problem, which leads to unsatisfactory performance, especially in the generalized ZSL (GZSL) task. To tackle such a problem, we formulate a discriminative cross-aligned variational autoencoder (DCA-VAE) for ZSL. The proposed model effectively utilizes a modified cross-modal-alignment variational autoencoder (VAE) to transform both visual features and semantic features obtained by the discriminative cosine metric into latent features. The key to our method is that we collect principal discriminative information from visual and semantic features to construct latent features which contain the discriminative multimodal information associated with unseen samples. Finally, the proposed model DCA-VAE is validated on six benchmarks including the large dataset ImageNet, and several experimental results demonstrate the superiority of DCA-VAE over most existing embedding or generative ZSL models on the standard ZSL and the more realistic GZSL tasks.

14.
IEEE Trans Neural Netw Learn Syst ; 34(6): 3183-3194, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-34587096

RESUMO

In this article, we present a conceptually simple but effective framework called knowledge distillation classifier generation network (KDCGN) for zero-shot learning (ZSL), where the learning agent requires recognizing unseen classes that have no visual data for training. Different from the existing generative approaches that synthesize visual features for unseen classifiers' learning, the proposed framework directly generates classifiers for unseen classes conditioned on the corresponding class-level semantics. To ensure the generated classifiers to be discriminative to the visual features, we borrow the knowledge distillation idea to both supervise the classifier generation and distill the knowledge with, respectively, the visual classifiers and soft targets trained from a traditional classification network. Under this framework, we develop two, respectively, strategies, i.e., class augmentation and semantics guidance, to facilitate the supervision process from the perspectives of improving visual classifiers. Specifically, the class augmentation strategy incorporates some additional categories to train the visual classifiers, which regularizes the visual classifier weights to be compact, under supervision of which the generated classifiers will be more discriminative. The semantics-guidance strategy encodes the class semantics into the visual classifiers, which would facilitate the supervision process by minimizing the differences between the generated and the real-visual classifiers. To evaluate the effectiveness of the proposed framework, we have conducted extensive experiments on five datasets in image classification, i.e., AwA1, AwA2, CUB, FLO, and APY. Experimental results show that the proposed approach performs best in the traditional ZSL task and achieves a significant performance improvement on four out of the five datasets in the generalized ZSL task.

15.
IEEE Trans Neural Netw Learn Syst ; 34(5): 2425-2439, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-34695000

RESUMO

Accurate object detection requires correct classification and high-quality localization. Currently, most of the single shot detectors (SSDs) conduct simultaneous classification and regression using a fully convolutional network. Despite high efficiency, this structure has some inappropriate designs for accurate object detection. The first one is the mismatch of bounding box classification, where the classification results of the default bounding boxes are improperly treated as the results of the regressed bounding boxes during the inference. The second one is that only one-time regression is not good enough for high-quality object localization. To solve the problem of classification mismatch, we propose a novel reg-offset-cls (ROC) module including three hierarchical steps: the regression of the default bounding box, the prediction of new feature sampling locations, and the classification of the regressed bounding box with more accurate features. For high-quality localization, we stack two ROC modules together. The input of the second ROC module is the output of the first ROC module. In addition, we inject a feature enhanced (FE) module between two stacked ROC modules to extract more contextual information. The experiments on three different datasets (i.e., MS COCO, PASCAL VOC, and UAVDT) are performed to demonstrate the effectiveness and superiority of our method. Without any bells or whistles, our proposed method outperforms state-of-the-art one-stage methods at a real-time speed. The source code is available at https://github.com/JialeCao001/HSD.

16.
IEEE Trans Med Imaging ; 42(3): 594-605, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36219664

RESUMO

Deep learning-based semi-supervised learning (SSL) algorithms are promising in reducing the cost of manual annotation of clinicians by using unlabelled data, when developing medical image segmentation tools. However, to date, most existing semi-supervised learning (SSL) algorithms treat the labelled images and unlabelled images separately and ignore the explicit connection between them; this disregards essential shared information and thus hinders further performance improvements. To mine the shared information between the labelled and unlabelled images, we introduce a class-specific representation extraction approach, in which a task-affinity module is specifically designed for representation extraction. We further cast the representation into two different views of feature maps; one is focusing on low-level context, while the other concentrates on structural information. The two views of feature maps are incorporated into the task-affinity module, which then extracts the class-specific representations to aid the knowledge transfer from the labelled images to the unlabelled images. In particular, a task-affinity consistency loss between the labelled images and unlabelled images based on the multi-scale class-specific representations is formulated, leading to a significant performance improvement. Experimental results on three datasets show that our method consistently outperforms existing state-of-the-art methods. Our findings highlight the potential of consistency between class-specific knowledge for semi-supervised medical image segmentation. The code and models are to be made publicly available at https://github.com/jingkunchen/TAC.


Assuntos
Algoritmos , Aprendizado de Máquina Supervisionado
17.
Gen Psychiatr ; 36(6): e101304, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38169807

RESUMO

Background: Individual differences have been detected in individuals with opioid use disorders (OUD) in rehabilitation following protracted abstinence. Recent studies suggested that prediction models were effective for individual-level prognosis based on neuroimage data in substance use disorders (SUD). Aims: This prospective cohort study aimed to assess neuroimaging biomarkers for individual response to protracted abstinence in opioid users using connectome-based predictive modelling (CPM). Methods: One hundred and eight inpatients with OUD underwent structural and functional magnetic resonance imaging (fMRI) scans at baseline. The Heroin Craving Questionnaire (HCQ) was used to assess craving levels at baseline and at the 8-month follow-up of abstinence. CPM with leave-one-out cross-validation was used to identify baseline networks that could predict follow-up HCQ scores and changes in HCQ (HCQfollow-up-HCQbaseline). Then, the predictive ability of identified networks was tested in a separate, heterogeneous sample of methamphetamine individuals who underwent MRI scanning before abstinence for SUD. Results: CPM could predict craving changes induced by long-term abstinence, as shown by a significant correlation between predicted and actual HCQfollow-up (r=0.417, p<0.001) and changes in HCQ (negative: r=0.334, p=0.002;positive: r=0.233, p=0.038). Identified craving-related prediction networks included the somato-motor network (SMN), salience network (SALN), default mode network (DMN), medial frontal network, visual network and auditory network. In addition, decreased connectivity of frontal-parietal network (FPN)-SMN, FPN-DMN and FPN-SALN and increased connectivity of subcortical network (SCN)-DMN, SCN-SALN and SCN-SMN were positively correlated with craving levels. Conclusions: These findings highlight the potential applications of CPM to predict the craving level of individuals after protracted abstinence, as well as the generalisation ability; the identified brain networks might be the focus of innovative therapies in the future.

18.
IEEE Trans Image Process ; 31: 6621-6634, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36256711

RESUMO

Most existing RGB-D salient object detection (SOD) models adopt a two-stream structure to extract the information from the input RGB and depth images. Since they use two subnetworks for unimodal feature extraction and multiple multi-modal feature fusion modules for extracting cross-modal complementary information, these models require a huge number of parameters, thus hindering their real-life applications. To remedy this situation, we propose a novel middle-level feature fusion structure that allows to design a lightweight RGB-D SOD model. Specifically, the proposed structure first employs two shallow subnetworks to extract low- and middle-level unimodal RGB and depth features, respectively. Afterward, instead of integrating middle-level unimodal features multiple times at different layers, we just fuse them once via a specially designed fusion module. On top of that, high-level multi-modal semantic features are further extracted for final salient object detection via an additional subnetwork. This will greatly reduce the network's parameters. Moreover, to compensate for the performance loss due to parameter deduction, a relation-aware multi-modal feature fusion module is specially designed to effectively capture the cross-modal complementary information during the fusion of middle-level multi-modal features. By enabling the feature-level and decision-level information to interact, we maximize the usage of the fused cross-modal middle-level features and the extracted cross-modal high-level features for saliency prediction. Experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods. Remarkably, our proposed model has only 3.9M parameters and runs at 33 FPS.

19.
IEEE Trans Image Process ; 31: 6719-6732, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36282823

RESUMO

Recently, the Part-Object Relational (POR) saliency underpinned by the Capsule Network (CapsNet) has been demonstrated to be an effective modeling mechanism to improve the saliency detection accuracy. However, it is widely known that the current capsule routing operations have huge computational complexity, which seriously limited the usability of the POR saliency models in real-time applications. To this end, this paper takes an early step towards a fast POR saliency inference by proposing a novel disentangled part-object relational network. Concretely, we disentangle horizontal routing and vertical routing from the original omnidirectional capsule routing, thus generating Disentangled Capsule Routing (DCR). This mechanism enjoys two advantages. On one hand, DCR that disentangles orthogonal 1D (i.e., vertical and horizontal) routing greatly reduces parameters and routing complexity, resulting in much faster inference than omnidirectional 2D routing adopted by existing CapsNets. On the other hand, thanks to the light POR cues explored by DCR, we could conveniently integrate the part-object routing process to different feature layers in CNN, rather than just applying it to the small-scaled one as in previous works. This helps to increase saliency inference accuracy. Compared to previous POR saliency detectors, DPORTNet infers visual saliency (5  âˆ¼ 9 ) × faster, and is more accurate. DPORTNet is available under the open-source license at https://github.com/liuyi1989/DCR.

20.
Artigo em Inglês | MEDLINE | ID: mdl-36070273

RESUMO

The performance of zero-shot learning (ZSL) can be improved progressively by learning better features and generating pseudosamples for unseen classes. Existing ZSL works typically learn feature extractors and generators independently, which may shift the unseen samples away from their real distribution and suffers from the domain bias problem. In this article, to tackle this challenge, we propose a variational autoencoder (VAE)-based framework, that is, joint Attentive Region Embedding with Enhanced Semantics (AREES), which is tailored to advance the zero-shot recognition. Specifically, AREES is end-to-end trainable and consists of three network branches: 1) attentive region embedding is used to learn the semantic-guided visual features by the attention mechanism (AM); 2) a decomposition structure and a semantic pivot regularization are used to extract enhanced semantics; and 3) a multimodal VAE (mVAE) with the cross-reconstruction loss and the distribution alignment loss is used to obtain a shared latent embedding space of visual features and semantics. Finally, features' extraction and features' generation are optimized together in AREES to address the domain shift problem to a large extent. The comprehensive evaluations on six benchmarks, including the ImageNet, demonstrate the superiority of the proposed model over its state-of-the-art counterparts.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA