Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
IEEE Trans Med Imaging ; PP2024 Jul 16.
Artículo en Inglés | MEDLINE | ID: mdl-39012729

RESUMEN

Existing deep learning methods have achieved remarkable results in diagnosing retinal diseases, showcasing the potential of advanced AI in ophthalmology. However, the black-box nature of these methods obscures the decision-making process, compromising their trustworthiness and acceptability. Inspired by the concept-based approaches and recognizing the intrinsic correlation between retinal lesions and diseases, we regard retinal lesions as concepts and propose an inherently interpretable framework designed to enhance both the performance and explainability of diagnostic models. Leveraging the transformer architecture, known for its proficiency in capturing long-range dependencies, our model can effectively identify lesion features. By integrating with image-level annotations, it achieves the alignment of lesion concepts with human cognition under the guidance of a retinal foundation model. Furthermore, to attain interpretability without losing lesion-specific information, our method employs a classifier built on a cross-attention mechanism for disease diagnosis and explanation, where explanations are grounded in the contributions of human-understandable lesion concepts and their visual localization. Notably, due to the structure and inherent interpretability of our model, clinicians can implement concept-level interventions to correct the diagnostic errors by simply adjusting erroneous lesion predictions. Experiments conducted on four fundus image datasets demonstrate that our method achieves favorable performance against state-of-the-art methods while providing faithful explanations and enabling conceptlevel interventions. Our code is publicly available at https://github.com/Sorades/CLAT.

2.
Artículo en Inglés | MEDLINE | ID: mdl-38917282

RESUMEN

Federated learning has emerged as a promising paradigm for privacy-preserving collaboration among different parties. Recently, with the popularity of federated learning, an influx of approaches have delivered towards different realistic challenges. In this survey, we provide a systematic overview of the important and recent developments of research on federated learning. Firstly, we introduce the study history and terminology definition of this area. Then, we comprehensively review three basic lines of research: generalization, robustness, and fairness, by introducing their respective background concepts, task settings, and main challenges. We also offer a detailed overview of representative literature on both methods and datasets. We further benchmark the reviewed methods on several well-known datasets. Finally, we point out several open issues in this field and suggest opportunities for further research. We also provide a public website to continuously track developments in this fast advancing field: https://github.com/WenkeHuang/MarsFL.

3.
Artículo en Inglés | MEDLINE | ID: mdl-38691434

RESUMEN

This article studies an emerging practical problem called heterogeneous prototype learning (HPL). Unlike the conventional heterogeneous face synthesis (HFS) problem that focuses on precisely translating a face image from a source domain to another target one without removing facial variations, HPL aims at learning the variation-free prototype of an image in the target domain while preserving the identity characteristics. HPL is a compounded problem involving two cross-coupled subproblems, that is, domain transfer and prototype learning (PL), thus making most of the existing HFS methods that simply transfer the domain style of images unsuitable for HPL. To tackle HPL, we advocate disentangling the prototype and domain factors in their respective latent feature spaces and then replacing the source domain with the target one for generating a new heterogeneous prototype. In doing so, the two subproblems in HPL can be solved jointly in a unified manner. Based on this, we propose a disentangled HPL framework, dubbed DisHPL, which is composed of one encoder-decoder generator and two discriminators. The generator and discriminators play adversarial games such that the generator embeds contaminated images into a prototype feature space only capturing identity information and a domain-specific feature space, while generating realistic-looking heterogeneous prototypes. Experiments on various heterogeneous datasets with diverse variations validate the superiority of DisHPL.

4.
IEEE Trans Image Process ; 33: 2627-2638, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38536683

RESUMEN

Visual intention understanding is a challenging task that explores the hidden intention behind the images of publishers in social media. Visual intention represents implicit semantics, whose ambiguous definition inevitably leads to label shifting and label blemish. The former indicates that the same image delivers intention discrepancies under different data augmentations, while the latter represents that the label of intention data is susceptible to errors or omissions during the annotation process. This paper proposes a novel method, called Label-aware Calibration and Relation-preserving (LabCR) to alleviate the above two problems from both intra-sample and inter-sample views. First, we disentangle the multiple intentions into a single intention for explicit distribution calibration in terms of the overall and the individual. Calibrating the class probability distributions in augmented instance pairs provides consistent inferred intention to address label shifting. Second, we utilize the intention similarity to establish correlations among samples, which offers additional supervision signals to form correlation alignments in instance pairs. This strategy alleviates the effect of label blemish. Extensive experiments have validated the superiority of the proposed method LabCR in visual intention understanding and pedestrian attribute recognition. Code is available at https://github.com/ShiQingHongYa/LabCR.

5.
Ophthalmol Ther ; 13(5): 1125-1144, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38416330

RESUMEN

INTRODUCTION: Inaccurate, untimely diagnoses of fundus diseases leads to vision-threatening complications and even blindness. We built a deep learning platform (DLP) for automatic detection of 30 fundus diseases using ultra-widefield fluorescein angiography (UWFFA) with deep experts aggregation. METHODS: This retrospective and cross-sectional database study included a total of 61,609 UWFFA images dating from 2016 to 2021, involving more than 3364 subjects in multiple centers across China. All subjects were divided into 30 different groups. The state-of-the-art convolutional neural network architecture, ConvNeXt, was chosen as the backbone to train and test the receiver operating characteristic curve (ROC) of the proposed system on test data and external test date. We compared the classification performance of the proposed system with that of ophthalmologists, including two retinal specialists. RESULTS: We built a DLP to analyze UWFFA, which can detect up to 30 fundus diseases, with a frequency-weighted average area under the receiver operating characteristic curve (AUC) of 0.940 in the primary test dataset and 0.954 in the external multi-hospital test dataset. The tool shows comparable accuracy with retina specialists in diagnosis and evaluation. CONCLUSIONS: This is the first study on a large-scale UWFFA dataset for multi-retina disease classification. We believe that our UWFFA DLP advances the diagnosis by artificial intelligence (AI) in various retinal diseases and would contribute to labor-saving and precision medicine especially in remote areas.

6.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 2299-2315, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-37966933

RESUMEN

This paper introduces a simple yet powerful channel augmentation for visible-infrared re-identification. Most existing augmentation operations designed for single-modality visible images do not fully consider the imagery properties in visible to infrared matching. Our basic idea is to homogeneously generate color-irrelevant images by randomly exchanging the color channels. It can be seamlessly integrated into existing augmentation operations, consistently improving the robustness against color variations. For cross-modality metric learning, we design an enhanced channel-mixed learning strategy to simultaneously handle the intra- and cross-modality variations with squared difference for stronger discriminability. Besides, a weak-and-strong augmentation joint learning strategy is further developed to explicitly optimize the outputs of augmented images, which mutually integrates the channel augmented images (strong) and the general augmentation operations (weak) with consistency regularization. Furthermore, by conducting the label association between the channel augmented images and infrared modalities with modality-specific clustering, a simple yet effective unsupervised learning baseline is designed, which significantly outperforms existing unsupervised single-modality solutions. Extensive experiments with insightful analysis on two visible-infrared recognition tasks show that the proposed strategies consistently improve the accuracy. Without auxiliary information, the Rank-1/mAP achieves 71.48%/68.15% on the large-scale SYSU-MM01 dataset.

7.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2950-2964, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38010930

RESUMEN

Matching hand-drawn sketches with photos (a.k.a sketch-photo recognition or re-identification) faces the information asymmetry challenge due to the abstract nature of the sketch modality. Existing works tend to learn shared embedding spaces with CNN models by discarding the appearance cues for photo images or introducing GAN for sketch-photo synthesis. The former unavoidably loses discriminability, while the latter contains ineffaceable generation noise. In this paper, we start the first attempt to design an information-aligned sketch transformer (SketchTrans +) via cross-modal disentangled prototype learning, while the transformer has shown great promise for discriminative visual modelling. Specifically, we design an asymmetric disentanglement scheme with a dynamic updatable auxiliary sketch (A-sketch) to align the modality representations without sacrificing information. The asymmetric disentanglement decomposes the photo representations into sketch-relevant and sketch-irrelevant cues, transferring sketch-irrelevant knowledge into the sketch modality to compensate for the missing information. Moreover, considering the feature discrepancy between the two modalities, we present a modality-aware prototype contrastive learning method that mines representative modality-sharing information using the modality-aware prototypes rather than the original feature representations. Extensive experiments on category- and instance-level sketch-based datasets validate the superiority of our proposed method under various metrics.

8.
Artículo en Inglés | MEDLINE | ID: mdl-37878434

RESUMEN

Federated learning is an important privacy-preserving multi-party learning paradigm, involving collaborative learning with others and local updating on private data. Model heterogeneity and catastrophic forgetting are two crucial challenges, which greatly limit the applicability and generalizability. This paper presents a novel FCCL+, federated correlation and similarity learning with non-target distillation, facilitating the both intra-domain discriminability and inter-domain generalization. For heterogeneity issue, we leverage irrelevant unlabeled public data for communication between the heterogeneous participants. We construct cross-correlation matrix and align instance similarity distribution on both logits and feature levels, which effectively overcomes the communication barrier and improves the generalizable ability. For catastrophic forgetting in local updating stage, FCCL+ introduces Federated Non Target Distillation, which retains inter-domain knowledge while avoiding the optimization conflict issue, fulling distilling privileged inter-domain information through depicting posterior classes relation. Considering that there is no standard benchmark for evaluating existing heterogeneous federated learning under the same setting, we present a comprehensive benchmark with extensive representative methods under four domain shift scenarios, supporting both heterogeneous and homogeneous federated settings. Empirical results demonstrate the superiority of our method and the efficiency of modules on various scenarios. The benchmark code for reproducing our results is available at https://github.com/WenkeHuang/FCCL.

9.
Artículo en Inglés | MEDLINE | ID: mdl-37883256

RESUMEN

Automatic lesion segmentation is important for assisting doctors in the diagnostic process. Recent deep learning approaches heavily rely on large-scale datasets, which are difficult to obtain in many clinical applications. Leveraging external labelled datasets is an effective solution to tackle the problem of insufficient training data. In this paper, we propose a new framework, namely LatenTrans, to utilize existing datasets for boosting the performance of lesion segmentation in extremely low data regimes. LatenTrans translates non-target lesions into target-like lesions and expands the training dataset with target-like data for better performance. Images are first projected to the latent space via aligned style-based generative models, and rich lesion semantics are encoded using the latent codes. A novel consistency-aware latent code manipulation module is proposed to enable high-quality local style transfer from non-target lesions to target-like lesions while preserving other parts. Moreover, we propose a new metric, Normalized Latent Distance, to solve the question of how to select an adequate one from various existing datasets for knowledge transfer. Extensive experiments are conducted on segmenting lung and brain lesions, and the experimental results demonstrate that our proposed LatenTrans is superior to existing methods for cross-disease lesion segmentation.

10.
IEEE Trans Image Process ; 32: 5099-5113, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37669187

RESUMEN

Daytime visible modality (RGB) and night-time infrared (IR) modality person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. However, training a cross-modality ReID model requires plenty of cross-modality (visible-infrared) identity labels that are more expensive than single-modality person ReID. To alleviate this issue, this paper studies unsupervised domain adaptive visible infrared person re-identification (UDA-VI-ReID) task without the reliance on any cross-modality annotation. To transfer learned knowledge from the labelled visible source domain to the unlabelled visible-infrared target domain, we propose a Translation, Association and Augmentation (TAA) framework. Specifically, the modality translator is firstly utilized to transfer visible image to infrared image, formulating generated visible-infrared image pairs for cross-modality supervised training. A Robust Association and Mutual Learning (RAML) module is then designed to exploit the underlying relations between visible and infrared modalities for label noise modeling. Moreover, a Translation Supervision and Feature Augmentation (TSFA) module is designed to enhance the discriminability by enriching the supervision with feature augmentation and modality translation. The extensive experimental results demonstrate that our method significantly outperforms current state-of-the-art unsupervised methods under various settings, and even surpasses some supervised counterparts, providing a powerful baseline for UDA-VI-ReID.

11.
IEEE Trans Image Process ; 32: 5075-5086, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37669190

RESUMEN

For the long-term person re-identification (ReID) task, pedestrians are likely to change clothes, which poses a key challenge in overcoming drastic appearance variations caused by these cloth changes. However, analyzing how cloth changes influence identity-invariant representation learning is difficult. In this context, varying cloth-changed samples are not adaptively utilized, and their effects on the resulting features are overshadowed. To address these limitations, this paper aims to estimate the effect of cloth-changing patterns at both the image and feature levels, presenting a Dual-Level Adaptive Weighting (DLAW) solution. Specifically, at the image level, we propose an adaptive mining strategy to locate the cloth-changed regions for each identity. This strategy highlights the informative areas that have undergone changes, enhancing robustness against cloth variations. At the feature level, we estimate the degree of cloth-changing by modeling the correlation of part-level features and re-weighting identity-invariant feature components. This further eliminates the effects of cloth variations at the semantic body part level. Extensive experiments demonstrate that our method achieves promising performance on several cloth-changing datasets. Code and models are available at https: //github.com/fountaindream/DLAW.

12.
IEEE Trans Image Process ; 32: 4543-4554, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37531308

RESUMEN

Composing Text and Image to Image Retrieval (CTI-IR) aims at finding the target image, which matches the query image visually along with the query text semantically. However, existing works ignore the fact that the reference text usually serves multiple functions, e.g., modification and auxiliary. To address this issue, we put forth a unified solution, namely Hierarchical Aggregation Transformer incorporated with Cross Relation Network (CRN). CRN unifies modification and relevance manner in a single framework. This configuration shows broader applicability, enabling us to model both modification and auxiliary text or their combination in triplet relationships simultaneously. Specifically, CRN includes: 1) Cross Relation Network comprehensively captures the relationships of various composed retrieval scenarios caused by two different query text types, allowing a unified retrieval model to designate adaptive combination strategies for flexible applicability; 2) Hierarchical Aggregation Transformer aggregates top-down features with Multi-layer Perceptron (MLP) to overcome the limitations of edge information loss in a window-based multi-stage Transformer. Extensive experiments demonstrate the superiority of the proposed CRN over all three fashion-domain datasets. Code is available at github.com/yan9qu/crn.

13.
IEEE Trans Image Process ; 32: 2190-2201, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37018096

RESUMEN

Visual intention understanding is the task of exploring the potential and underlying meaning expressed in images. Simply modeling the objects or backgrounds within the image content leads to unavoidable comprehension bias. To alleviate this problem, this paper proposes a Cross-modality Pyramid Alignment with Dynamic optimization (CPAD) to enhance the global understanding of visual intention with hierarchical modeling. The core idea is to exploit the hierarchical relationship between visual content and textual intention labels. For visual hierarchy, we formulate the visual intention understanding task as a hierarchical classification problem, capturing multiple granular features in different layers, which corresponds to hierarchical intention labels. For textual hierarchy, we directly extract the semantic representation from intention labels at different levels, which supplements the visual content modeling without extra manual annotations. Moreover, to further narrow the domain gap between different modalities, a cross-modality pyramid alignment module is designed to dynamically optimize the performance of visual intention understanding in a joint learning manner. Comprehensive experiments intuitively demonstrate the superiority of our proposed method, outperforming existing visual intention understanding methods.

14.
Artículo en Inglés | MEDLINE | ID: mdl-37022854

RESUMEN

This article presents a new adaptive metric distillation approach that can significantly improve the student networks' backbone features, along with better classification results. Previous knowledge distillation (KD) methods usually focus on transferring the knowledge across the classifier logits or feature structure, ignoring the excessive sample relations in the feature space. We demonstrated that such a design greatly limits performance, especially for the retrieval task. The proposed collaborative adaptive metric distillation (CAMD) has three main advantages: 1) the optimization focuses on optimizing the relationship between key pairs by introducing the hard mining strategy into the distillation framework; 2) it provides an adaptive metric distillation that can explicitly optimize the student feature embeddings by applying the relation in the teacher embeddings as supervision; and 3) it employs a collaborative scheme for effective knowledge aggregation. Extensive experiments demonstrated that our approach sets a new state-of-the-art in both the classification and retrieval tasks, outperforming other cutting-edge distillers under various settings.

15.
IEEE Trans Neural Netw Learn Syst ; 34(2): 867-881, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-34403349

RESUMEN

Single sample per person face recognition (SSPP FR) is one of the most challenging problems in FR due to the extreme lack of enrolment data. To date, the most popular SSPP FR methods are the generic learning methods, which recognize query face images based on the so-called prototype plus variation (i.e., P+V) model. However, the classic P+V model suffers from two major limitations: 1) it linearly combines the prototype and variation images in the observational pixel-spatial space and cannot generalize to multiple nonlinear variations, e.g., poses, which are common in face images and 2) it would be severely impaired once the enrolment face images are contaminated by nuisance variations. To address the two limitations, it is desirable to disentangle the prototype and variation in a latent feature space and to manipulate the images in a semantic manner. To this end, we propose a novel disentangled prototype plus variation model, dubbed DisP+V, which consists of an encoder-decoder generator and two discriminators. The generator and discriminators play two adversarial games such that the generator nonlinearly encodes the images into a latent semantic space, where the more discriminative prototype feature and the less discriminative variation feature are disentangled. Meanwhile, the prototype and variation features can guide the generator to generate an identity-preserved prototype and the corresponding variation, respectively. Experiments on various real-world face datasets demonstrate the superiority of our DisP+V model over the classic P+V model for SSPP FR. Furthermore, DisP+V demonstrates its unique characteristics in both prototype recovery and face editing/interpolation.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Humanos , Cara , Reconocimiento de Normas Patrones Automatizadas/métodos
16.
IEEE Trans Med Imaging ; 42(3): 797-809, 2023 03.
Artículo en Inglés | MEDLINE | ID: mdl-36288236

RESUMEN

Coronavirus disease 2019 (COVID-19) has become a severe global pandemic. Accurate pneumonia infection segmentation is important for assisting doctors in diagnosing COVID-19. Deep learning-based methods can be developed for automatic segmentation, but the lack of large-scale well-annotated COVID-19 training datasets may hinder their performance. Semi-supervised segmentation is a promising solution which explores large amounts of unlabelled data, while most existing methods focus on pseudo-label refinement. In this paper, we propose a new perspective on semi-supervised learning for COVID-19 pneumonia infection segmentation, namely pseudo-label guided image synthesis. The main idea is to keep the pseudo-labels and synthesize new images to match them. The synthetic image has the same COVID-19 infected regions as indicated in the pseudo-label, and the reference style extracted from the style code pool is added to make it more realistic. We introduce two representative methods by incorporating the synthetic images into model training, including single-stage Synthesis-Assisted Cross Pseudo Supervision (SA-CPS) and multi-stage Synthesis-Assisted Self-Training (SA-ST), which can work individually as well as cooperatively. Synthesis-assisted methods expand the training data with high-quality synthetic data, thus improving the segmentation performance. Extensive experiments on two COVID-19 CT datasets for segmenting the infections demonstrate our method is superior to existing schemes for semi-supervised segmentation, and achieves the state-of-the-art performance on both datasets. Code is available at: https://github.com/FeiLyu/SASSL.


Asunto(s)
COVID-19 , Neumonía , Humanos , COVID-19/diagnóstico por imagen , Pandemias , Aprendizaje Automático Supervisado
17.
IEEE Trans Image Process ; 31: 4227-4239, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35727784

RESUMEN

This paper studies the challenging person re-identification (Re-ID) task under the cloth-changing scenario, where the same identity (ID) suffers from uncertain cloth changes. To learn cloth- and ID-invariant features, it is crucial to collect abundant training data with varying clothes, which is difficult in practice. To alleviate the reliance on rich data collection, we reinforce the feature learning process by designing powerful complementary data augmentation strategies, including positive and negative data augmentation. Specifically, the positive augmentation fulfills the ID space by randomly patching the person images with different clothes, simulating rich appearance to enhance the robustness against clothes variations. For negative augmentation, its basic idea is to randomly generate out-of-distribution synthetic samples by combining various appearance and posture factors from real samples. The designed strategies seamlessly reinforce the feature learning without additional information introduction. Extensive experiments conducted on both cloth-changing and -unchanging tasks demonstrate the superiority of our proposed method, consistently improving the accuracy over various baselines.


Asunto(s)
Postura , Recolección de Datos , Humanos , Incertidumbre
18.
IEEE Trans Med Imaging ; 41(9): 2510-2520, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-35404812

RESUMEN

Automatic liver tumor segmentation could offer assistance to radiologists in liver tumor diagnosis, and its performance has been significantly improved by recent deep learning based methods. These methods rely on large-scale well-annotated training datasets, but collecting such datasets is time-consuming and labor-intensive, which could hinder their performance in practical situations. Learning from synthetic data is an encouraging solution to address this problem. In our task, synthetic tumors can be injected to healthy images to form training pairs. However, directly applying the model trained using the synthetic tumor images on real test images performs poorly due to the domain shift problem. In this paper, we propose a novel approach, namely Synthetic-to-Real Test-Time Training (SR-TTT), to reduce the domain gap between synthetic training images and real test images. Specifically, we add a self-supervised auxiliary task, i.e., two-step reconstruction, which takes the output of the main segmentation task as its input to build an explicit connection between these two tasks. Moreover, we design a scheduled mixture strategy to avoid error accumulation and bias explosion in the training process. During test time, we adapt the segmentation model to each test image with self-supervision from the auxiliary task so as to improve the inference performance. The proposed method is extensively evaluated on two public datasets for liver tumor segmentation. The experimental results demonstrate that our proposed SR-TTT can effectively mitigate the synthetic-to-real domain shift problem in the liver tumor segmentation task, and is superior to existing state-of-the-art approaches.


Asunto(s)
Neoplasias Hepáticas , Redes Neurales de la Computación , Abdomen , Humanos , Procesamiento de Imagen Asistido por Computador/métodos , Neoplasias Hepáticas/diagnóstico por imagen , Tomografía Computarizada por Rayos X/métodos
19.
IEEE Trans Image Process ; 31: 2352-2364, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35235507

RESUMEN

Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval problem, which aims at matching the same pedestrian between the visible and infrared cameras. Due to the existence of pose variation, occlusion, and huge visual differences between the two modalities, previous studies mainly focus on learning image-level shared features. Since they usually learn a global representation or extract uniformly divided part features, these methods are sensitive to misalignments. In this paper, we propose a structure-aware positional transformer (SPOT) network to learn semantic-aware sharable modality features by utilizing the structural and positional information. It consists of two main components: attended structure representation (ASR) and transformer-based part interaction (TPI). Specifically, ASR models the modality-invariant structure feature for each modality and dynamically selects the discriminative appearance regions under the guidance of the structure information. TPI mines the part-level appearance and position relations with a transformer to learn discriminative part-level modality features. With a weighted combination of ASR and TPI, the proposed SPOT explores the rich contextual and structural information, effectively reducing cross-modality difference and enhancing the robustness against misalignments. Extensive experiments indicate that SPOT is superior to the state-of-the-art methods on two cross-modal datasets. Notably, the Rank-1/mAP value on the SYSU-MM01 dataset has improved by 8.43%/6.80%.


Asunto(s)
Peatones , Semántica , Humanos
20.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 924-939, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-32750841

RESUMEN

Deep embedding learning plays a key role in learning discriminative feature representations, where the visually similar samples are pulled closer and dissimilar samples are pushed away in the low-dimensional embedding space. This paper studies the unsupervised embedding learning problem by learning such a representation without using any category labels. This task faces two primary challenges: mining reliable positive supervision from highly similar fine-grained classes, and generalizing to unseen testing categories. To approximate the positive concentration and negative separation properties in category-wise supervised learning, we introduce a data augmentation invariant and instance spreading feature using the instance-wise supervision. We also design two novel domain-agnostic augmentation strategies to further extend the supervision in feature space, which simulates the large batch training using a small batch size and the augmented features. To learn such a representation, we propose a novel instance-wise softmax embedding, which directly perform the optimization over the augmented instance features with the binary discrmination softmax encoding. It significantly accelerates the learning speed with much higher accuracy than existing methods, under both seen and unseen testing categories. The unsupervised embedding performs well even without pre-trained network over samples from fine-grained categories. We also develop a variant using category-wise supervision, namely category-wise softmax embedding, which achieves competitive performance over the state-of-of-the-arts, without using any auxiliary information or restrict sample mining.


Asunto(s)
Algoritmos , Atención
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA