Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 31
Filter
1.
IEEE Trans Image Process ; 32: 5823-5836, 2023.
Article in English | MEDLINE | ID: mdl-37847622

ABSTRACT

Existing deep learning-based shadow removal methods still produce images with shadow remnants. These shadow remnants typically exist in homogeneous regions with low-intensity values, making them untraceable in the existing image-to-image mapping paradigm. We observe that shadows mainly degrade images at the image-structure level (in which humans perceive object shapes and continuous colors). Hence, in this paper, we propose to remove shadows at the image structure level. Based on this idea, we propose a novel structure-informed shadow removal network (StructNet) to leverage the image-structure information to address the shadow remnant problem. Specifically, StructNet first reconstructs the structure information of the input image without shadows and then uses the restored shadow-free structure prior to guiding the image-level shadow removal. StructNet contains two main novel modules: 1) a mask-guided shadow-free extraction (MSFE) module to extract image structural features in a non-shadow-to-shadow directional manner; and 2) a multi-scale feature & residual aggregation (MFRA) module to leverage the shadow-free structure information to regularize feature consistency. In addition, we also propose to extend StructNet to exploit multi-level structure information (MStructNet), to further boost the shadow removal performance with minimum computational overheads. Extensive experiments on three shadow removal benchmarks demonstrate that our method outperforms existing shadow removal methods, and our StructNet can be integrated with existing methods to improve them further.

2.
Article in English | MEDLINE | ID: mdl-37018668

ABSTRACT

Existing deraining methods mainly focus on a single input image. However, with just a single input image, it is extremely difficult to accurately detect and remove rain streaks, in order to restore a rain-free image. In contrast, a light field image (LFI) embeds abundant 3D structure and texture information of the target scene by recording the direction and position of each incident ray via a plenoptic camera, which has emerged as a popular device in the computer vision and graphics research communities. However, making full use of the abundant information available from LFIs, such as 2D array of sub-views and the disparity map of each sub-view, for effective rain removal is still a challenging problem. In this paper, we propose a novel network, 4D-MGP-SRRNet, for rain streak removal from LFIs. Our method takes as input all sub-views of a rainy LFI. In order to make full use of the LFI, we adopt 4D convolutional layers to build the proposed rain steak removal network to simultaneously process all sub-views of the LFI. In the proposed network, the rain detection model, MGPDNet, with a novel Multi-scale Self-guided Gaussian Process (MSGP) module is proposed to detect high-resolution rain streaks from all sub-views of the input LFI at multi-scales. Semi-supervised learning is introduced for MSGP to accurately detect rain streaks by training on both virtual-world rainy LFIs and real-world rainy LFIs at multi-scales via calculating pseudo ground truths for real-world rain streaks. We then feed all sub-views subtracting the predicted rain streaks into a 4D convolution-based Depth Estimation Residual Network (DERNet) to estimate the depth maps, which are later converted into fog maps. Finally, all sub-views concatenated with the corresponding rain streaks and fog maps are fed into a powerful rainy LFI restoring model based on the adversarial recurrent neural network to progressively eliminate rain streaks and recover the rain-free LFI. Extensive quantitative and qualitative evaluations conducted on both synthetic LFIs and real-world LFIs demonstrate the effectiveness of our proposed method.

3.
IEEE Trans Vis Comput Graph ; 29(9): 3922-3936, 2023 Sep.
Article in English | MEDLINE | ID: mdl-35503828

ABSTRACT

With the goal of making contents easy to understand, memorize and share, a clear and easy-to-follow layout is important for visual notes. Unfortunately, since visual notes are often taken by the designers in real time while watching a video or listening to a presentation, the contents are usually not carefully structured, resulting in layouts that may be difficult for others to follow. In this article, we address this problem by proposing a novel approach to automatically optimize the layouts of visual notes. Our approach predicts the design order of a visual note and then warps the contents along the predicted design order such that the visual note can be easier to follow and understand. At the core of our approach is a learning-based framework to reason about the element-wise design orders of visual notes. In particular, we first propose a hierarchical LSTM-based architecture to predict a grid-based design order of the visual note, based on the graphical and textual information. We then derive the element-wise order from the grid-based prediction. Such an idea allows our network to be weakly-supervised, i.e., making it possible to predict dense grid-based orders from visual notes with only coarse annotations. We evaluate the effectiveness of our approach on visual notes with diverse content densities and layouts. The results show that our network can predict plausible design orders for various types of visual notes and our approach can effectively optimize their layouts in order for them to be easier to follow.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3329-3346, 2023 Mar.
Article in English | MEDLINE | ID: mdl-35984803

ABSTRACT

Glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot may crash into a glass wall. However, sensing the presence of glass is not straightforward. The key challenge is that arbitrary objects/scenes can appear behind the glass. In this paper, we propose an important problem of detecting glass surfaces from a single RGB image. To address this problem, we construct the first large-scale glass detection dataset (GDD) and propose a novel glass detection network, called GDNet-B, which explores abundant contextual cues in a large field-of-view via a novel large-field contextual feature integration (LCFI) module and integrates both high-level and low-level boundary features with a boundary feature enhancement (BFE) module. Extensive experiments demonstrate that our GDNet-B achieves satisfying glass detection results on the images within and beyond the GDD testing set. We further validate the effectiveness and generalization capability of our proposed GDNet-B by applying it to other vision tasks, including mirror segmentation and salient object detection. Finally, we show the potential applications of glass detection and discuss possible future research directions.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3492-3504, 2023 Mar.
Article in English | MEDLINE | ID: mdl-35687623

ABSTRACT

Mirror detection is challenging because the visual appearances of mirrors change depending on those of their surroundings. As existing mirror detection methods are mainly based on extracting contextual contrast and relational similarity between mirror and non-mirror regions, they may fail to identify a mirror region if these assumptions are violated. Inspired by a recent study of applying a CNN to help distinguish whether an image is flipped or not based on the visual chirality property, in this paper, we rethink this image-level visual chirality property and reformulate it as a learnable pixel level cue for mirror detection. Specifically, we first propose a novel flipping-convolution-flipping (FCF) transformation to model visual chirality as learnable commutative residual. We then propose a novel visual chirality embedding (VCE) module to exploit this commutative residual in multi-scale feature maps, to embed the visual chirality features into our mirror detection model. Besides, we also propose a visual chirality-guided edge detection (CED) module to integrate the visual chirality features with contextual features for detection refinement. Extensive experiments show that the proposed method outperforms state-of-the-art methods on three benchmark datasets.

6.
IEEE Trans Image Process ; 31: 6295-6305, 2022.
Article in English | MEDLINE | ID: mdl-36149997

ABSTRACT

Most existing salient object detection (SOD) methods are designed for RGB images and do not take advantage of the abundant information provided by light fields. Hence, they may fail to detect salient objects of complex structures and delineate their boundaries. Although some methods have explored multi-view information of light field images for saliency detection, they require tedious pixel-level manual annotations of ground truths. In this paper, we propose a novel weakly-supervised learning framework for salient object detection on light field images based on bounding box annotations. Our method has two major novelties. First, given an input light field image and a bounding-box annotation indicating the salient object, we propose a ground truth label hallucination method to generate a pixel-level pseudo saliency map, to avoid heavy cost of pixel-level annotations. This method generates high quality pseudo ground truth saliency maps to help supervise the training, by exploiting information obtained from the light field (including depths and RGB images). Second, to exploit the multi-view nature of the light field data in learning, we propose a fusion attention module to calibrate the spatial and channel-wise light field representations. It learns to focus on informative features and suppress redundant information from the multi-view inputs. Based on these two novelties, we are able to train a new salient object detector with two branches in a weakly-supervised manner. While the RGB branch focuses on modeling the color contrast in the all-in-focus image for locating the salient objects, the Focal branch exploits the depth and the background spatial redundancy of focal slices for eliminating background distractions. Extensive experiments show that our method outperforms existing weakly-supervised methods and most fully supervised methods.

7.
IEEE Trans Pattern Anal Mach Intell ; 44(9): 5280-5292, 2022 09.
Article in English | MEDLINE | ID: mdl-33905322

ABSTRACT

Contextual information plays an important role in solving various image and scene understanding tasks. Prior works have focused on the extraction of contextual information from an image and use it to infer the properties of some object(s) in the image or understand the scene behind the image, e.g., context-based object detection, recognition and semantic segmentation. In this paper, we consider an inverse problem, i.e., how to hallucinate the missing contextual information from the properties of standalone objects. We refer to it as object-level scene context prediction. This problem is difficult, as it requires extensive knowledge of the complex and diverse relationships among objects in the scene. We propose a deep neural network, which takes as input the properties (i.e., category, shape, and position) of a few standalone objects to predict an object-level scene layout that compactly encodes the semantics and structure of the scene context where the given objects are. Quantitative experiments and user studies demonstrate that our model can generate more plausible scene contexts than the baselines. Our model also enables the synthesis of realistic scene images from partial scene layouts. Finally, we validate that our model internally learns useful features for scene recognition and fake scene detection.


Subject(s)
Algorithms , Semantics
8.
IEEE Trans Image Process ; 30: 9085-9098, 2021.
Article in English | MEDLINE | ID: mdl-34705644

ABSTRACT

Although huge progress has been made on scene analysis in recent years, most existing works assume the input images to be in day-time with good lighting conditions. In this work, we aim to address the night-time scene parsing (NTSP) problem, which has two main challenges: 1) labeled night-time data are scarce, and 2) over- and under-exposures may co-occur in the input night-time images and are not explicitly modeled in existing pipelines. To tackle the scarcity of night-time data, we collect a novel labeled dataset, named NightCity, of 4,297 real night-time images with ground truth pixel-level semantic annotations. To our knowledge, NightCity is the largest dataset for NTSP. In addition, we also propose an exposure-aware framework to address the NTSP problem through augmenting the segmentation process with explicitly learned exposure features. Extensive experiments show that training on NightCity can significantly improve NTSP performances and that our exposure-aware model outperforms the state-of-the-art methods, yielding top performances on our dataset as well as existing datasets.

9.
IEEE Trans Image Process ; 30: 8497-8509, 2021.
Article in English | MEDLINE | ID: mdl-34623268

ABSTRACT

Rain degrades image visual quality and disrupts object structures, obscuring their details and erasing their colors. Existing deraining methods are primarily based on modeling either visual appearances of rain or its physical characteristics (e.g., rain direction and density), and thus suffer from two common problems. First, due to the stochastic nature of rain, they tend to fail in recognizing rain streaks correctly, and wrongly remove image structures and details. Second, they fail to recover the image colors erased by heavy rain. In this paper, we address these two problems with the following three contributions. First, we propose a novel PHP block to aggregate comprehensive spatial and hierarchical information for removing rain streaks of different sizes. Second, we propose a novel network to first remove rain streaks, then recover objects structures/colors, and finally enhance details. Third, to train the network, we prepare a new dataset, and propose a novel loss function to introduce semantic and color regularization for deraining. Extensive experiments demonstrate the superiority of the proposed method over state-of-the-art deraining methods on both synthesized and real-world data, in terms of visual quality, quantitative accuracy, and running speed.

10.
IEEE Trans Image Process ; 30: 4423-4435, 2021.
Article in English | MEDLINE | ID: mdl-33856991

ABSTRACT

In this paper, we propose a novel form of weak supervision for salient object detection (SOD) based on saliency bounding boxes, which are minimum rectangular boxes enclosing the salient objects. Based on this idea, we propose a novel weakly-supervised SOD method, by predicting pixel-level pseudo ground truth saliency maps from just saliency bounding boxes. Our method first takes advantage of the unsupervised SOD methods to generate initial saliency maps and addresses the over/under prediction problems, to obtain the initial pseudo ground truth saliency maps. We then iteratively refine the initial pseudo ground truth by learning a multi-task map refinement network with saliency bounding boxes. Finally, the final pseudo saliency maps are used to supervise the training of a salient object detector. Experimental results show that our method outperforms state-of-the-art weakly-supervised methods.

11.
IEEE Trans Image Process ; 30: 3885-3896, 2021.
Article in English | MEDLINE | ID: mdl-33764875

ABSTRACT

Synthesizing high dynamic range (HDR) images from multiple low-dynamic range (LDR) exposures in dynamic scenes is challenging. There are two major problems caused by the large motions of foreground objects. One is the severe misalignment among the LDR images. The other is the missing content due to the over-/under-saturated regions caused by the moving objects, which may not be easily compensated for by the multiple LDR exposures. Thus, it requires the HDR generation model to be able to properly fuse the LDR images and restore the missing details without introducing artifacts. To address these two problems, we propose in this paper a novel GAN-based model, HDR-GAN, for synthesizing HDR images from multi-exposed LDR images. To our best knowledge, this work is the first GAN-based approach for fusing multi-exposed LDR images for HDR reconstruction. By incorporating adversarial learning, our method is able to produce faithful information in the regions with missing content. In addition, we also propose a novel generator network, with a reference-based residual merging block for aligning large object motions in the feature domain, and a deep HDR supervision scheme for eliminating artifacts of the reconstructed HDR images. Experimental results demonstrate that our model achieves state-of-the-art reconstruction performance over the prior HDR methods on diverse scenes.

12.
Article in English | MEDLINE | ID: mdl-32365031

ABSTRACT

In this paper, we propose a deep CNN to tackle the image restoration problem by learning formatted information. Previous deep learning based methods directly learn the mapping from corrupted images to clean images, and may suffer from the gradient exploding/vanishing problems of deep neural networks. We propose to address the image restoration problem by learning the structured details and recovering the latent clean image together, from the shared information between the corrupted image and the latent image. In addition, instead of learning the pure difference (corruption), we propose to add a residual formatting layer and an adversarial block to format the information to structured one, which allows the network to converge faster and boosts the performance. Furthermore, we propose a cross-level loss net to ensure both pixel-level accuracy and semantic-level visual quality. Evaluations on public datasets show that the proposed method performs favorably against existing approaches quantitatively and qualitatively.

13.
Article in English | MEDLINE | ID: mdl-31603784

ABSTRACT

In this paper, we propose a novel method to generate stereoscopic images from light-field images with the intended depth range and simultaneously perform image super-resolution. Subject to the small baseline of neighboring subaperture views and low spatial resolution of light-field images captured using compact commercial light-field cameras, the disparity range of any two subaperture views is usually very small. We propose a method to control the disparity range of the target stereoscopic images with linear or nonlinear disparity scaling and properly resolve the disocclusion problem with the aid of a smooth energy term previously used for texture synthesis. The left and right views of the target stereoscopic image are simultaneously generated by a unified optimization framework, which preserves content coherence between the left and right views by a coherence energy term. The disparity range of the target stereoscopic image can be larger than that of the input light field image. This benefits many light field image-based applications, e.g., displaying light field images on various stereo display devices and generating stereoscopic panoramic images from a light field image montage. An extensive experimental evaluation demonstrates the effectiveness of our method.

14.
IEEE Trans Image Process ; 28(8): 3766-3777, 2019 Aug.
Article in English | MEDLINE | ID: mdl-30843833

ABSTRACT

The tracking-by-detection framework receives growing attention through the integration with the convolutional neural networks (CNNs). Existing tracking-by-detection-based methods, however, fail to track objects with severe appearance variations. This is because the traditional convolutional operation is performed on fixed grids, and thus may not be able to find the correct response while the object is changing pose or under varying environmental conditions. In this paper, we propose a deformable convolution layer to enrich the target appearance representations in the tracking-by-detection framework. We aim to capture the target appearance variations via deformable convolution, which adaptively enhances its original features. In addition, we also propose a gated fusion scheme to control how the variations captured by the deformable convolution affect the original appearance. The enriched feature representation through deformable convolution facilitates the discrimination of the CNN classifier on the target object and background. The extensive experiments on the standard benchmarks show that the proposed tracker performs favorably against the state-of-the-art methods.

15.
IEEE Trans Image Process ; 27(2): 764-777, 2018 Feb.
Article in English | MEDLINE | ID: mdl-29757730

ABSTRACT

We present an approach to localize generic actions in egocentric videos, called temporal action proposals (TAPs), for accelerating the action recognition step. An egocentric TAP refers to a sequence of frames that may contain a generic action performed by the wearer of a head-mounted camera, e.g., taking a knife, spreading jam, pouring milk, or cutting carrots. Inspired by object proposals, this paper aims at generating a small number of TAPs, thereby replacing the popular sliding window strategy, for localizing all action events in the input video. To this end, we first propose to temporally segment the input video into action atoms, which are the smallest units that may contain an action. We then apply a hierarchical clustering algorithm with several egocentric cues to generate TAPs. Finally, we propose two actionness networks to score the likelihood of each TAP containing an action. The top ranked candidates are returned as output TAPs. Experimental results show that the proposed TAP detection framework performs significantly better than relevant approaches for egocentric action detection.

16.
IEEE Trans Cybern ; 48(5): 1406-1419, 2018 May.
Article in English | MEDLINE | ID: mdl-28475073

ABSTRACT

Video decolorization is to filter out the color information while preserving the perceivable content in the video as much and correct as possible. Existing methods mainly apply image decolorization strategies on videos, which may be slow and produce incoherent results. In this paper, we propose a video decolorization framework that considers frame coherence and saves decolorization time by referring to the decolorized frames. It has three main contributions. First, we define decolorization proximity to measure the similarity of adjacent frames. Second, we propose three decolorization strategies for frames with low, medium, and high proximities, to preserve the quality of these three types of frames. Third, we propose a novel decolorization Gaussian mixture model to classify the frames and assign appropriate decolorization strategies to them based on their decolorization proximity. To evaluate our results, we measure them from three aspects: 1) qualitative; 2) quantitative; and 3) user study. We apply color contrast preserving ratio and C2G-SSIM to evaluate the quality of single frame decolorization. We propose a novel temporal coherence degree metric to evaluate the temporal coherence of the decolorized video. Compared with current methods, the proposed approach shows all around better performance in time efficiency, temporal coherence, and quality preservation.

17.
IEEE Trans Image Process ; 27(3): 1076-1085, 2018 03.
Article in English | MEDLINE | ID: mdl-29220312

ABSTRACT

In this paper, we propose a novel -regularized optimization framework for image downscaling. The optimization is driven by two -regularized priors. The first prior, gradient-ratio prior, is based on the observation that the number of edges in the downscaled image is approximately inverse square proportional to the downscaling factor. By introducing norm sparsity to the gradient ratio, the downscaled image is able to preserve the most salient edges as well as the visual perception of the original image. The second prior, downsampling prior, is to constrain the downsampling matrix so that pixels of the downscaled image are estimated according to those optimal neighboring pixels. Extensive experiments on the Urban100 and BSDS500 data sets show that the proposed algorithm achieves superior performance over the state-of-the-arts, in terms of both quality and robustness.

18.
IEEE Trans Image Process ; 26(10): 4991-5004, 2017 Oct.
Article in English | MEDLINE | ID: mdl-28742037

ABSTRACT

Non-uniform motion blur due to object movement or camera jitter is a common phenomenon in videos. However, the state-of-the-art video deblurring methods used to deal with this problem can introduce artifacts, and may sometimes fail to handle motion blur due to the movements of the object or the camera. In this paper, we propose a non-uniform motion model to deblur video frames. The proposed method is based on superpixel matching in the video sequence to reconstruct sharp frames from blurry ones. To identify a suitable sharp superpixel to replace a blurry one, we enrich the search space with a non-uniform motion blur kernel, and use a generalized PatchMatch algorithm to handle rotation, scale, and blur differences in the matching step. Instead of using pixel-based or regular patch-based representation, we adopt a superpixel-based representation, and use color and motion to gather similar pixels. Our non-uniform motion blur kernels are estimated from the motion field of these superpixels, and our spatially varying motion model considers spatial and temporal coherence to find sharp superpixels. Experimental results showed that the proposed method can reconstruct sharp video frames from blurred frames caused by complex object and camera movements, and performs better than the state-of-the-art methods.

19.
IEEE Trans Image Process ; 26(2): 671-683, 2017 Feb.
Article in English | MEDLINE | ID: mdl-27849541

ABSTRACT

Object proposal detection is an effective way of accelerating object recognition. Existing proposal methods are mostly based on detecting object boundaries, which may not be effective for cluttered backgrounds. In this paper, we leverage stereopsis as a robust and effective solution for generating object proposals. We first obtain a set of candidate bounding boxes through adaptive transformation, which fits the bounding boxes tightly to object boundaries detected by rough depth and color information. A two-level hierarchy composed of proposal and cluster levels is then constructed to estimate object locations in an efficient and accurate manner. Three stereo-based cues "exactness," "focus," and "distribution" are proposed for objectness estimation. Two-level hierarchical ranking is proposed to accurately obtain ranked object proposals. A stereo data set with 400 labeled stereo image pairs is constructed to evaluate the performance of the proposed method in both indoor and outdoor scenes. Extensive experimental evaluations show that the proposed stereo-based approach achieves a better performance than the state of the arts with either a small or a large number of object proposals. As stereopsis can be a complement to the color information, the proposed method can be integrated with existing proposal methods to obtain superior results.

20.
IEEE Comput Graph Appl ; 36(1): 52-61, 2016.
Article in English | MEDLINE | ID: mdl-25137724

ABSTRACT

Turntable-based 3D scanners are popular but require calibration of the turntable axis. Existing methods for turntable calibration typically make use of specially designed tools, such as a chessboard or criterion sphere, which users must manually install and dismount. In this article, the authors propose an automatic method to calibrate the turntable axis without any calibration tools. Given a scan sequence of the input object, they first recover the initial rotation axis from an automatic registration step. Then they apply an iterative procedure to obtain the optimized turntable axis. This iterative procedure alternates between two steps: refining the initial pose of the input scans and approximating the rotation matrix. The performance of the proposed method was evaluated on a structured light-based scanning system.

SELECTION OF CITATIONS
SEARCH DETAIL
...