Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 80
Filtrar
1.
IEEE Trans Image Process ; 33: 4796-4810, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39186414

RESUMEN

Balancing the trade-off between accuracy and speed for obtaining higher performance without sacrificing the inference time is a challenging topic for object detection task. Knowledge distillation, which serves as a kind of model compression techniques, provides a potential and feasible way to handle above efficiency and effectiveness issue through transferring the dark knowledge from the sophisticated teacher detector to the simple student one. Despite demonstrating promising solutions to make harmonies between accuracy and speed, current knowledge distillation for object detection methods still suffer from two limitations. Firstly, most of the methods are inherited or refereed from the frameworks in image classification task, and deploy an implicit manner by imitating or constraining the features from the intermediate layers or the output predictions between the teacher and student models. While little consideration has been raised to the intrinsic relevance of the classification and localization predictions in object detection task. Besides, these methods fail to investigate the relationship between detection and distillation tasks in knowledge distillation pipeline, and they train the whole network by simply integrating losses from these two different tasks through hand-crafted designation parameters. For addressing the aforementioned issues, we propose a novel Relation Knowledge Distillation by Auxiliary Learning for Object Detection (ReAL) method in this paper. Specifically, we first design a prediction relation distillation module which makes the student model directly mimic the output predictions from the teacher one, and conduct self and mutual relation distillation losses to excavate the relation information between teacher and student models. Moreover, for better devolving into the relationship between different tasks in distillation pipeline, we introduce the auxiliary learning into knowledge distillation for object detection and develop a dynamic weight adaptation strategy. Through regarding detection task as primary task and treating distillation task as auxiliary task in auxiliary learning framework, we dynamically adjust and regularize the corresponding weights of the losses for these tasks during the training process. Experiments on MS COCO dataset are conducted using various detector combinations of teacher and student models and the results show that our proposed ReAL can achieve obvious improvement on different distillation model configurations, while performing favorably against state-of-the-arts.

2.
Neural Netw ; 174: 106218, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38518709

RESUMEN

In image watermark removal, popular methods depend on given reference non-watermark images in a supervised way to remove watermarks. However, reference non-watermark images are difficult to be obtained in the real world. At the same time, they often suffer from the influence of noise when captured by digital devices. To resolve these issues, in this paper, we present a self-supervised network for image denoising and watermark removal (SSNet). SSNet uses a parallel network in a self-supervised learning way to remove noise and watermarks. Specifically, each sub-network contains two sub-blocks. The upper sub-network uses the first sub-block to remove noise, according to noise-to-noise. Then, the second sub-block in the upper sub-network is used to remove watermarks, according to the distributions of watermarks. To prevent the loss of important information, the lower sub-network is used to simultaneously learn noise and watermarks in a self-supervised learning way. Moreover, two sub-networks interact via attention to extract more complementary salient information. The proposed method does not depend on paired images to learn a blind denoising and watermark removal model, which is very meaningful for real applications. Also, it is more effective than the popular image watermark removal methods in public datasets. Codes can be found at https://github.com/hellloxiaotian/SSNet.

3.
Artículo en Inglés | MEDLINE | ID: mdl-38507386

RESUMEN

In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple cameras in modern smartphones, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the super-resolution (SR) of the lesser zoomed (ultra-wide) image, which gives us a chance to learn a deep network that performs SR from the dual zoomed observations (DZSR). Secondly, for self-supervised learning of DZSR, we take the telephoto image instead of an additional high-resolution image as the supervision information, and select a center patch from it as the reference to super-resolve the corresponding ultra-wide image patch. To mitigate the effect of the misalignment between ultra-wide low-resolution (LR) patch and telephoto ground-truth (GT) image during training, we first adopt patch-based optical flow alignment to obtain the warped LR, then further design an auxiliary-LR to guide the deforming of the warped LR features. To generate visually pleasing results, we present local overlapped sliced Wasserstein loss to better represent the perceptual difference between GT and output in the feature space. During testing, DZSR can be directly deployed to super-solve the whole ultra-wide image with the reference of the telephoto image. In addition, we further take multiple zoomed observations to explore self-supervised RefSR, and present a progressive fusion scheme for the effective utilization of reference images. Experiments show that our methods achieve better quantitative and qualitative performance against state-of-the-arts. The code and pre-trained models will be publicly available.

4.
IEEE Trans Image Process ; 33: 1977-1989, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38451756

RESUMEN

Recently, class incremental semantic segmentation (CISS) towards the practical open-world setting has attracted increasing research interest, which is mainly challenged by the well-known issue of catastrophic forgetting. Particularly, knowledge distillation (KD) techniques have been widely studied to alleviate catastrophic forgetting. Despite the promising performance, existing KD-based methods generally use the same distillation schemes for different intermediate layers to transfer old knowledge, while employing manually tuned and fixed trade-off weights to control the effect of KD. These KD-based methods take no consideration of feature characteristics from different intermediate layers, limiting the effectiveness of KD for CISS. In this paper, we propose a layer-specific knowledge distillation (LSKD) method to assign appropriate knowledge schemes and weights for various intermediate layers by considering feature characteristics, aiming to further explore the potential of KD in improving the performance of CISS. Specifically, we present a mask-guided distillation (MD) to alleviate the background shift on semantic features, which performs distillation by masking the features affected by the background. Furthermore, a mask-guided context distillation (MCD) is presented to explore global context information lying in high-level semantic features. Based on them, our LSKD assigns different distillation schemes according to feature characteristics. To adjust the effect of layer-specific distillation adaptively, LSKD introduces a regularized gradient equilibrium method to learn dynamic trade-off weights. Additionally, our LSKD makes an attempt to simultaneously learn distillation schemes and trade-off weights of different layers by developing a bi-level optimization method. Extensive experiments on widely used Pascal VOC 12 and ADE20K show our LSKD clearly outperforms its counterparts while achieving state-of-the-art results.

5.
IEEE Trans Image Process ; 33: 310-321, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38090849

RESUMEN

Image retouching, aiming to regenerate the visually pleasing renditions of given images, is a subjective task where the users are with different aesthetic sensations. Most existing methods adopt a deterministic model to learn the retouching style from a specific expert, making it less flexible to meet diverse subjective preferences. Besides, the intrinsic diversity of an expert due to the targeted processing of different images is also deficiently described. To circumvent such issues, we propose to learn diverse image retouching with normalizing flow-based architectures. Unlike current flow-based methods which directly generate the output image, we argue that learning in a one-dimensional style space could 1) disentangle the retouching styles from the image content, 2) lead to a stable style presentation form, and 3) avoid the spatial disharmony effects. For obtaining meaningful image tone style representations, a joint-training pipeline is delicately designed, which is composed of a style encoder, a conditional RetouchNet, and the image tone style normalizing flow (TSFlow) module. In particular, the style encoder predicts the target style representation of an input image, which serves as the conditional information in the RetouchNet for retouching, while the TSFlow maps the style representation vector into a Gaussian distribution in the forward pass. After training, the TSFlow can generate diverse image tone style vectors by sampling from the Gaussian distribution. Extensive experiments on MIT-Adobe FiveK and PPR10K datasets show that our proposed method performs favorably against state-of-the-art methods and is effective in generating diverse results to satisfy different human aesthetic preferences. Source codeterministic and pre-trained models are publicly available at https://github.com/SSRHeart/TSFlow.


Asunto(s)
Aprendizaje , Humanos , Distribución Normal
6.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15802-15819, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-37782579

RESUMEN

Global covariance pooling (GCP) as an effective alternative to global average pooling has shown good capacity to improve deep convolutional neural networks (CNNs) in a variety of vision tasks. Although promising performance, it is still an open problem on how GCP (especially its post-normalization) works in deep learning. In this paper, we make the effort towards understanding the effect of GCP on deep learning from an optimization perspective. Specifically, we first analyze behavior of GCP with matrix power normalization on optimization loss and gradient computation of deep architectures. Our findings show that GCP can improve Lipschitzness of optimization loss and achieve flatter local minima, while improving gradient predictiveness and functioning as a special pre-conditioner on gradients. Then, we explore the effect of post-normalization on GCP from the model optimization perspective, which encourages us to propose a simple yet effective normalization, namely DropCov. Based on above findings, we point out several merits of deep GCP that have not been recognized previously or fully explored, including faster convergence, stronger model robustness and better generalization across tasks. Extensive experimental results using both CNNs and vision transformers on diversified vision tasks provide strong support to our findings while verifying the effectiveness of our method.

7.
IEEE Trans Image Process ; 32: 5921-5932, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37883292

RESUMEN

The infrared small and dim (S&D) target detection is one of the key techniques in the infrared search and tracking system. Since the local regions similar to infrared S&D targets spread over the whole background, exploring the correlation amongst image features in large-range dependencies to mine the difference between the target and background is crucial for robust detection. However, existing deep learning-based methods are limited by the locality of convolutional neural networks, which impairs the ability to capture large-range dependencies. Additionally, the S&D appearance of the infrared target makes the detection model highly possible to miss detection. To this end, we propose a robust and general infrared S&D target detection method with the transformer. We adopt the self-attention mechanism of the transformer to learn the correlation of image features in a larger range. Moreover, we design a feature enhancement module to learn discriminative features of S&D targets to avoid miss-detections. After that, to avoid the loss of the target information, we adopt a decoder with the U-Net-like skip connection operation to contain more information of S&D targets. Finally, we get the detection result by a segmentation head. Extensive experiments on two public datasets show the obvious superiority of the proposed method over state-of-the-art methods, and the proposed method has a stronger generalization ability and better noise tolerance.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10070-10083, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37027640

RESUMEN

Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the prediction logits due to its inefficiency in distilling the localization information. In this paper, we investigate whether logit mimicking always lags behind feature imitation. Towards this goal, we first present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student. Second, we introduce the concept of valuable localization region that can aid to selectively distill the classification and localization knowledge for a certain region. Combining these two new components, for the first time, we show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking under-performs for years. The thorough studies exhibit the great potential of logit mimicking that can significantly alleviate the localization ambiguity, learn robust feature representation, and ease the training difficulty in the early stage. We also provide the theoretical connection between the proposed LD and the classification KD, that they share the equivalent optimization effect. Our distillation scheme is simple as well as effective and can be easily applied to both dense horizontal object detectors and rotated object detectors. Extensive experiments on the MS COCO, PASCAL VOC, and DOTA benchmarks demonstrate that our method can achieve considerable AP improvement without any sacrifice on the inference speed. Our source code and pretrained models are publicly available at https://github.com/HikariTJU/LD.


Asunto(s)
Algoritmos , Benchmarking , Humanos , Aprendizaje , Programas Informáticos
9.
Artículo en Inglés | MEDLINE | ID: mdl-37021850

RESUMEN

Cross-domain face translation aims to transfer face images from one domain to another. It can be widely used in practical applications, such as photos/sketches in law enforcement, photos/drawings in digital entertainment, and near-infrared (NIR)/visible (VIS) images in security access control. Restricted by limited cross-domain face image pairs, the existing methods usually yield structural deformation or identity ambiguity, which leads to poor perceptual appearance. To address this challenge, we propose a multi-view knowledge (structural knowledge and identity knowledge) ensemble framework with frequency consistency (MvKE-FC) for cross-domain face translation. Due to the structural consistency of facial components, the multi-view knowledge learned from large-scale data can be appropriately transferred to limited cross-domain image pairs and significantly improve the generative performance. To better fuse multi-view knowledge, we further design an attention-based knowledge aggregation module that integrates useful information, and we also develop a frequency-consistent (FC) loss that constrains the generated images in the frequency domain. The designed FC loss consists of a multidirection Prewitt (mPrewitt) loss for high-frequency consistency and a Gaussian blur loss for low-frequency consistency. Furthermore, our FC loss can be flexibly applied to other generative models to enhance their overall performance. Extensive experiments on multiple cross-domain face datasets demonstrate the superiority of our method over state-of-the-art methods both qualitatively and quantitatively.

10.
IEEE Trans Neural Netw Learn Syst ; 34(3): 1132-1145, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-34428157

RESUMEN

The entropy of the codes usually serves as the rate loss in the recent learned lossy image compression methods. Precise estimation of the probabilistic distribution of the codes plays a vital role in reducing the entropy and boosting the joint rate-distortion performance. However, existing deep learning based entropy models generally assume the latent codes are statistically independent or depend on some side information or local context, which fails to take the global similarity within the context into account and thus hinders the accurate entropy estimation. To address this issue, we propose a special nonlocal operation for context modeling by employing the global similarity within the context. Specifically, due to the constraint of context, nonlocal operation is incalculable in context modeling. We exploit the relationship between the code maps produced by deep neural networks and introduce the proxy similarity functions as a workaround. Then, we combine the local and the global context via a nonlocal attention block and employ it in masked convolutional networks for entropy modeling. Taking the consideration that the width of the transforms is essential in training low distortion models, we finally produce a U-net block in the transforms to increase the width with manageable memory consumption and time complexity. Experiments on Kodak and Tecnick datasets demonstrate the priority of the proposed context-based nonlocal attention block in entropy modeling and the U-net block in low distortion situations. On the whole, our model performs favorably against the existing image compression standards and recent deep image compression models.

11.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 2726-2735, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-35786551

RESUMEN

This article substantially extends our work published at ECCV (Li et al., 2020), in which an intermediate-level attack was proposed to improve the transferability of some baseline adversarial examples. Specifically, we advocate a framework in which a direct linear mapping from the intermediate-level discrepancies (between adversarial features and benign features) to prediction loss of the adversarial example is established. By delving deep into the core components of such a framework, we show that a variety of linear regression models can all be considered in order to establish the mapping, the magnitude of the finally obtained intermediate-level adversarial discrepancy is correlated with the transferability, and further boost of the performance can be achieved by performing multiple runs of the baseline attack with random initialization. In addition, by leveraging these findings, we achieve new state-of-the-arts on transfer-based l∞ and l2 attacks. Our code is publicly available at https://github.com/qizhangli/ila-plus-plus-lr.

12.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 5904-5917, 2023 May.
Artículo en Inglés | MEDLINE | ID: mdl-36251909

RESUMEN

Blind face restoration is a challenging task due to the unknown, unsynthesizable and complex degradation, yet is valuable in many practical applications. To improve the performance of blind face restoration, recent works mainly treat the two aspects, i.e., generic and specific restoration, separately. In particular, generic restoration attempts to restore the results through general facial structure prior, while on the one hand, cannot generalize to real-world degraded observations due to the limited capability of direct CNNs' mappings in learning blind restoration, and on the other hand, fails to exploit the identity-specific details. On the contrary, specific restoration aims to incorporate the identity features from the reference of the same identity, in which the requirement of proper reference severely limits the application scenarios. Generally, it is a challenging and intractable task to improve the photo-realistic performance of blind restoration and adaptively handle the generic and specific restoration scenarios with a single unified model. Instead of implicitly learning the mapping from a low-quality image to its high-quality counterpart, this paper suggests a DMDNet by explicitly memorizing the generic and specific features through dual dictionaries. First, the generic dictionary learns the general facial priors from high-quality images of any identity, while the specific dictionary stores the identity-belonging features for each person individually. Second, to handle the degraded input with or without specific reference, dictionary transform module is suggested to read the relevant details from the dual dictionaries which are subsequently fused into the input features. Finally, multi-scale dictionaries are leveraged to benefit the coarse-to-fine restoration. The whole framework including the generic and specific dictionaries is optimized in an end-to-end manner and can be flexibly plugged into different application scenarios. Moreover, a new high-quality dataset, termed CelebRef-HQ, is constructed to promote the exploration of specific face restoration in the high-resolution space. Experimental results demonstrate that the proposed DMDNet performs favorably against the state of the arts in both quantitative and qualitative evaluation, and generates more photo-realistic results on the real-world low-quality images. The codes, models and the CelebRef-HQ dataset will be publicly available at https://github.com/csxmli2016/DMDNet.

13.
IEEE Trans Image Process ; 31: 6664-6678, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36260596

RESUMEN

Multimodal image synthesis has emerged as a viable solution to the modality missing challenge. Most existing approaches employ softmax-based classifiers to provide modal constraints for the generated models. These methods, however, focus on learning to distinguish inter-domain differences while failing to build intra-domain compactness, resulting in inferior synthetic results. To provide sufficient domain-specific constraint, we hereby introduce a novel prototype discriminator for generative adversarial network (PT-GAN) to effectively estimate the missing or noisy modalities. Different from most previous works, we introduce the Radial Basis Function (RBF) network, endowing the discriminator with domain-specific prototypes, to improve the optimization of generative model. Since the prototype learning extracts more discriminative representation of each domain, and emphasizes intra-domain compactness, it reduces the sensitivity of discriminator to pixel changes in generated images. To address this dilemma, we further propose a reconstructive regularization term which connects the discriminator with the generator, thus enhancing its pixel detectability. To this end, the proposed PT-GAN provides not only consistent domain-specific constraints, but also reasonable uncertainty estimation of generated images with the RBF distance. Experimental results show that our method outperforms the state-of-the-art techniques. The source code will be available at: https://github.com/zhiweibi/PT-GAN.

14.
Artículo en Inglés | MEDLINE | ID: mdl-36227812

RESUMEN

Convolutional neural networks (CNNs) have obtained remarkable performance via deep architectures. However, these CNNs often achieve poor robustness for image super-resolution (SR) under complex scenes. In this article, we present a heterogeneous group SR CNN (HGSRCNN) via leveraging structure information of different types to obtain a high-quality image. Specifically, each heterogeneous group block (HGB) of HGSRCNN uses a heterogeneous architecture containing a symmetric group convolutional block and a complementary convolutional block in a parallel way to enhance the internal and external relations of different channels for facilitating richer low-frequency structure information of different types. To prevent the appearance of obtained redundant features, a refinement block (RB) with signal enhancements in a serial way is designed to filter useless information. To prevent the loss of original information, a multilevel enhancement mechanism guides a CNN to achieve a symmetric architecture for promoting expressive ability of HGSRCNN. Besides, a parallel upsampling mechanism is developed to train a blind SR model. Extensive experiments illustrate that the proposed HGSRCNN has obtained excellent SR performance in terms of both quantitative and qualitative analysis. Codes can be accessed at https://github.com/hellloxiaotian/HGSRCNN.

15.
Neural Netw ; 153: 373-385, 2022 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-35779445

RESUMEN

CNNs with strong learning abilities are widely chosen to resolve super-resolution problem. However, CNNs depend on deeper network architectures to improve performance of image super-resolution, which may increase computational cost in general. In this paper, we present an enhanced super-resolution group CNN (ESRGCNN) with a shallow architecture by fully fusing deep and wide channel features to extract more accurate low-frequency information in terms of correlations of different channels in single image super-resolution (SISR). Also, a signal enhancement operation in the ESRGCNN is useful to inherit more long-distance contextual information for resolving long-term dependency. An adaptive up-sampling operation is gathered into a CNN to obtain an image super-resolution model with low-resolution images of different sizes. Extensive experiments report that our ESRGCNN surpasses the state-of-the-arts in terms of SISR performance, complexity, execution speed, image quality evaluation and visual effect in SISR. Code is found at https://github.com/hellloxiaotian/ESRGCNN.


Asunto(s)
Procesamiento de Imagen Asistido por Computador , Redes Neurales de la Computación , Procesamiento de Imagen Asistido por Computador/métodos
16.
Sci Rep ; 12(1): 12229, 2022 07 18.
Artículo en Inglés | MEDLINE | ID: mdl-35851829

RESUMEN

Crowd counting is considered a challenging issue in computer vision. One of the most critical challenges in crowd counting is considering the impact of scale variations. Compared with other methods, better performance is achieved with CNN-based methods. However, given the limit of fixed geometric structures, the head-scale features are not completely obtained. Deformable convolution with additional offsets is widely used in the fields of image classification and pattern recognition, as it can successfully exploit the potential of spatial information. However, owing to the randomly generated parameters of offsets in network initialization, the sampling points of the deformable convolution are disorderly stacked, weakening the effectiveness of feature extraction. To handle the invalid learning of offsets and the inefficient utilization of deformable convolution, an offset-decoupled deformable convolution (ODConv) is proposed in this paper. It can completely obtain information within the effective region of sampling points, leading to better performance. In extensive experiments, average MAE of 62.3, 8.3, 91.9, and 159.3 are achieved using our method on the ShanghaiTech A, ShanghaiTech B, UCF-QNRF, and UCF_CC_50 datasets, respectively, outperforming the state-of-the-art methods and validating the effectiveness of the proposed ODConv.


Asunto(s)
Algoritmos
17.
IEEE Trans Image Process ; 31: 3541-3552, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35544507

RESUMEN

With the progress in AI-based facial forgery (i.e., deepfake), people are concerned about its abuse. Albeit effort has been made for training models to recognize such forgeries, existing models suffer from poor generalization to unseen forgery technologies and high sensitivity to changes in image/video quality. In this paper, we advocate robust training for improving the generalization ability. We believe training with samples that are adversarially crafted to attack the classification models improves the generalization ability considerably. Considering that AI-based face manipulation often leads to high-frequency artifacts that can be easily spotted (by models) yet difficult to generalize, we further propose a new adversarial training method that attempts to blur out these artifacts, by introducing pixel-wise Gaussian blurring. Plenty of empirical evidence show that, with adversarial training, models are forced to learn more discriminative and generalizable features. Our code: https://github.com/ah651/deepfake_adv.

18.
IEEE Trans Image Process ; 31: 2962-2974, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35353700

RESUMEN

Object detection is usually solved by learning a deep architecture involving classification and localization tasks, where feature learning for these two tasks is shared using the same backbone model. Recent works have shown that suitable disentanglement of classification and localization tasks has the great potential to improve performance of object detection. Despite the promising performance, existing feature disentanglement methods usually suffer from two limitations. First, most of them only focus on the disentangled proposals or predication heads for classification and localization tasks after RPN. While little consideration has been given to that the features for these two different tasks actually are obtained by a shared backbone model before RPN. Second, they are suggested for two-stage objectors and are not applicable to one-stage methods. To overcome these limitations, this paper presents a novel fully task-specific feature learning method for one-stage object detection. Specifically, our method first learns disentangled features for classification and localization tasks using two separated backbone models, where auxiliary classification and localization heads are inserted at the end of the two backbone models for providing a fully task-specific features for classification and localization. Then, a feature interaction module is developed for aligning and fusing task-specific features, which are further used to produce the final detection result. Experiments on MS COCO show that our proposed method (dubbed CrabNet) can achieve clear improvement over counterparts with increasing limited inference time, while performing favorably against state-of-the-arts.

19.
IEEE Trans Image Process ; 31: 1406-1417, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35038294

RESUMEN

Weakly supervised semantic segmentation (WSSS) based on bounding box annotations has attracted considerable recent attention and has achieved promising performance. However, most of existing methods focus on generation of high-quality pseudo labels for segmented objects using box indicators, but they fail to fully explore and exploit prior from bounding box annotations, which limits performance of WSSS methods, especially for fine parts and boundaries. To overcome above issues, this paper proposes a novel Pixel-as-Instance Prior (PIP) for WSSS methods by delving deeper into pixel prior from bounding box annotations. Specifically, the proposed PIP is built on two important observations on pixels around bounding boxes. First, since objects are usually irregularity and tightly close to bounding boxes (dubbed irregular-filling prior), so each row or column of bounding boxes basically have at least one pixel belonging to foreground objects and background, respectively. Second, pixels near the bounding boxes tend to be highly ambiguous and more difficult to classify (dubbed label-ambiguity prior). To implement our PIP, a constrained loss alike multiple instance learning (MIL) and a labeling-balance loss are developed to jointly train WSSS models, which regards each pixel as a weighted positive or negative instance while considering more effective prior (i.e., irregular-filling and label-ambiguity priors) from bounding box annotations in an efficient way. Note that our PIP can be flexibly integrated with various WSSS methods, while clearly improving their performance with negligible computational overload in training stage. The experiments are conducted on most widely used PASCAL VOC 2012 and Cityscapes benchmarks, and the results show that our PIP has a good ability to improve performance of various WSSS methods, while achieving very competitive results.

20.
Artículo en Inglés | MEDLINE | ID: mdl-37015433

RESUMEN

Occluded person re-identification (ReID) is a challenging task due to more background noises and incomplete foreground information. Although existing human parsing-based ReID methods can tackle this problem with semantic alignment at the finest pixel level, their performance is heavily affected by the human parsing model. Most supervised methods propose to train an extra human parsing model aside from the ReID model with cross-domain human parts annotation, suffering from expensive annotation cost and domain gap; Unsupervised methods integrate a feature clustering-based human parsing process into the ReID model, but lacking supervision signals brings less satisfactory segmentation results. In this paper, we argue that the pre-existing information in the ReID training dataset can be directly used as supervision signals to train the human parsing model without any extra annotation. By integrating a weakly supervised human co-parsing network into the ReID network, we propose a novel framework that exploits shared information across different images of the same pedestrian, called the Human Co-parsing Guided Alignment (HCGA) framework. Specifically, the human co-parsing network is weakly supervised by three consistency criteria, namely global semantics, local space, and background. By feeding the semantic information and deep features from the person ReID network into the guided alignment module, features of the foreground and human parts can then be obtained for effective occluded person ReID. Experiment results on two occluded and two holistic datasets demonstrate the superiority of our method. Especially on Occluded-DukeMTMC, it achieves 70.2% Rank-1 accuracy and 57.5% mAP.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA