Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-33596172

RESUMO

Image segmentation is a key task in computer vision and image processing with important applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among others, and numerous segmentation algorithms are found in the literature. Against this backdrop, the broad success of Deep Learning (DL) has prompted the development of new image segmentation approaches leveraging DL models. We provide a comprehensive review of this recent literature, covering the spectrum of pioneering efforts in semantic and instance segmentation, including convolutional pixel-labeling networks, encoder-decoder architectures, multiscale and pyramid-based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the relationships, strengths, and challenges of these DL-based segmentation models, examine the widely used datasets, compare performances, and discuss promising research directions.

2.
IEEE Trans Pattern Anal Mach Intell ; 42(9): 2148-2164, 2020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-31056489

RESUMO

In popular TV programs (such as CSI), a very low-resolution face image of a person, who is not even looking at the camera in many cases, is digitally super-resolved to a degree that suddenly the person's identity is made visible and recognizable. Of course, we suspect that this is merely a cinematographic special effect and such a magical transformation of a single image is not technically possible. Or, is it? In this paper, we push the boundaries of super-resolving (hallucinating to be more accurate) a tiny, non-frontal face image to understand how much of this is possible by leveraging the availability of large datasets and deep networks. To this end, we introduce a novel Transformative Adversarial Neural Network (TANN) to jointly frontalize very-low resolution (i.e., 16 × 16 pixels) out-of-plane rotated face images (including profile views) and aggressively super-resolve them (8×), regardless of their original poses and without using any 3D information. TANN is composed of two components: a transformative upsampling network which embodies encoding, spatial transformation and deconvolutional layers, and a discriminative network that enforces the generated high-resolution frontal faces to lie on the same manifold as real frontal face images. We evaluate our method on a large set of synthesized non-frontal face images to assess its reconstruction performance. Extensive experiments demonstrate that TANN generates both qualitatively and quantitatively superior results achieving over 4 dB improvement over the state-of-the-art.

3.
IEEE Trans Pattern Anal Mach Intell ; 42(11): 2926-2943, 2020 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-31095477

RESUMO

Given a tiny face image, existing face hallucination methods aim at super-resolving its high-resolution (HR) counterpart by learning a mapping from an exemplary dataset. Since a low-resolution (LR) input patch may correspond to many HR candidate patches, this ambiguity may lead to distorted HR facial details and wrong attributes such as gender reversal and rejuvenation. An LR input contains low-frequency facial components of its HR version while its residual face image, defined as the difference between the HR ground-truth and interpolated LR images, contains the missing high-frequency facial details. We demonstrate that supplementing residual images or feature maps with additional facial attribute information can significantly reduce the ambiguity in face super-resolution. To explore this idea, we develop an attribute-embedded upsampling network, which consists of an upsampling network and a discriminative network. The upsampling network is composed of an autoencoder with skip-connections, which incorporates facial attribute vectors into the residual features of LR inputs at the bottleneck of the autoencoder, and deconvolutional layers used for upsampling. The discriminative network is designed to examine whether super-resolved faces contain the desired attributes or not and then its loss is used for updating the upsampling network. In this manner, we can super-resolve tiny (16×16 pixels) unaligned face images with a large upscaling factor of 8× while reducing the uncertainty of one-to-many mappings remarkably. By conducting extensive evaluations on a large-scale dataset, we demonstrate that our method achieves superior face hallucination results and outperforms the state-of-the-art.

4.
Artigo em Inglês | MEDLINE | ID: mdl-31796388

RESUMO

Hyperparameters are numerical pre-sets whose values are assigned prior to the commencement of a learning process. Selecting appropriate hyperparameters is often critical for achieving satisfactory performance in many vision problems such as deep learning-based visual object tracking. Yet it is difficult to determine their optimal values, in particular, adaptive ones for each specific video input. Most hyperparameter optimization algorithms depend on searching a generic range and they are imposed blindly on all sequences. In this paper, we propose a novel dynamical hyperparameter optimization method that adaptively optimizes hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning. Since the observation space for visual object tracking is significantly more complex than those in traditional control problems, existing continuous deep Q-learning algorithms cannot be directly applied. To overcome this challenge, we introduce an efficient heuristic strategy to handle high dimensional state space and meanwhile accelerate the convergence behavior. The proposed algorithm is applied to improve two representative trackers, a Siamese-based one and a correlation-filter-based one, to evaluate its generality. Their superior performances on several popular benchmarks are clearly demonstrated.

5.
Artigo em Inglês | MEDLINE | ID: mdl-31613765

RESUMO

Stereo videos for the dynamic scenes often show unpleasant blurred effects due to the camera motion and the multiple moving objects with large depth variations. Given consecutive blurred stereo video frames, we aim to recover the latent clean images, estimate the 3D scene flow and segment the multiple moving objects. These three tasks have been previously addressed separately, which fail to exploit the internal connections among these tasks and cannot achieve optimality. In this paper, we propose to jointly solve these three tasks in a unified framework by exploiting their intrinsic connections. To this end, we represent the dynamic scenes with the piece-wise planar model, which exploits the local structure of the scene and expresses various dynamic scenes. Under our model, these three tasks are naturally connected and expressed as the parameter estimation of 3D scene structure and camera motion (structure and motion for the dynamic scenes). By exploiting the blur model constraint, the moving objects and the 3D scene structure, we reach an energy minimization formulation for joint deblurring, scene flow and segmentation. We evaluate our approach extensively on both synthetic datasets and publicly available real datasets with fast-moving objects, camera motion, uncontrolled lighting conditions and shadows. Experimental results demonstrate that our method can achieve significant improvement in stereo video deblurring, scene flow estimation and moving object segmentation, over state-of-the-art methods.

6.
IEEE Trans Image Process ; 28(10): 4819-4831, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-31059438

RESUMO

Video saliency detection aims to continuously discover the motion-related salient objects from the video sequences. Since it needs to consider the spatial and temporal constraints jointly, video saliency detection is more challenging than image saliency detection. In this paper, we propose a new method to detect the salient objects in video based on sparse reconstruction and propagation. With the assistance of novel static and motion priors, a single-frame saliency model is first designed to represent the spatial saliency in each individual frame via the sparsity-based reconstruction. Then, through a progressive sparsity-based propagation, the sequential correspondence in the temporal space is captured to produce the inter-frame saliency map. Finally, these two maps are incorporated into a global optimization model to achieve spatio-temporal smoothness and global consistency of the salient object in the whole video. The experiments on three large-scale video saliency datasets demonstrate that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.

7.
IEEE Trans Image Process ; 28(7): 3516-3527, 2019 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-30762546

RESUMO

In the same vein of discriminative one-shot learning, Siamese networks allow recognizing an object from a single exemplar with the same class label. However, they do not take advantage of the underlying structure of the data and the relationship among the multitude of samples as they only rely on the pairs of instances for training. In this paper, we propose a new quadruplet deep network to examine the potential connections among the training instances, aiming to achieve a more powerful representation. We design a shared network with four branches that receive a multi-tuple of instances as inputs and are connected by a novel loss function consisting of pair loss and triplet loss. According to the similarity metric, we select the most similar and the most dissimilar instances as the positive and negative inputs of triplet loss from each multi-tuple. We show that this scheme improves the training performance. Furthermore, we introduce a new weight layer to automatically select suitable combination weights, which will avoid the conflict between triplet and pair loss leading to worse performance. We evaluate our quadruplet framework by model-free tracking-by-detection of objects from a single initial exemplar in several visual object tracking benchmarks. Our extensive experimental analysis demonstrates that our tracker achieves superior performance with a real-time processing speed of 78 frames/s. Our source code is available.

8.
IEEE Trans Neural Netw Learn Syst ; 30(9): 2637-2649, 2019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-30624228

RESUMO

In this paper, we propose a framework of maximizing quadratic submodular energy with a knapsack constraint approximately, to solve certain computer vision problems. The proposed submodular maximization problem can be viewed as a generalization of the classic 0/1 knapsack problem. Importantly, maximization of our knapsack constrained submodular energy function can be solved via dynamic programing. We further introduce a range-reduction step prior to dynamic programing as a two-stage procedure for more efficient maximization. In order to demonstrate the effectiveness of the proposed energy function and its maximization algorithm, we apply it to two representative computer vision tasks: image segmentation and motion trajectory clustering. Experimental results of image segmentation demonstrate that our method outperforms the classic segmentation algorithms of graph cuts and random walks. Moreover, our framework achieves better performance than state-of-the-art methods on the motion trajectory clustering task.

9.
Neural Netw ; 110: 82-90, 2019 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-30504041

RESUMO

The big breakthrough on the ImageNet challenge in 2012 was partially due to the 'Dropout' technique used to avoid overfitting. Here, we introduce a new approach called 'Spectral Dropout' to improve the generalization ability of deep neural networks. We cast the proposed approach in the form of regular Convolutional Neural Network (CNN) weight layers using a decorrelation transform with fixed basis functions. Our spectral dropout method prevents overfitting by eliminating weak and 'noisy' Fourier domain coefficients of the neural network activations, leading to remarkably better results than the current regularization methods. Furthermore, the proposed is very efficient due to the fixed basis functions used for spectral transformation. In particular, compared to Dropout and Drop-Connect, our method significantly speeds up the network convergence rate during the training process (roughly ×2), with considerably higher neuron pruning rates (an increase of ∼30%). We demonstrate that the spectral dropout can also be used in conjunction with other regularization approaches resulting in additional performance gains.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Aprendizado Profundo/tendências
10.
IEEE Trans Pattern Anal Mach Intell ; 41(9): 2112-2130, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-30004871

RESUMO

A fundamental problem in image deblurring is to recover reliably distinct spatial frequencies that have been suppressed by the blur kernel. To tackle this issue, existing image deblurring techniques often rely on generic image priors such as the sparsity of salient features including image gradients and edges. However, these priors only help recover part of the frequency spectrum, such as the frequencies near the high-end. To this end, we pose the following specific questions: (i) Does any image class information offer an advantage over existing generic priors for image quality restoration? (ii) If a class-specific prior exists, how should it be encoded into a deblurring framework to recover attenuated image frequencies? Throughout this work, we devise a class-specific prior based on the band-pass filter responses and incorporate it into a deblurring strategy. More specifically, we show that the subspace of band-pass filtered images and their intensity distributions serve as useful priors for recovering image frequencies that are difficult to recover by generic image priors. We demonstrate that our image deblurring framework, when equipped with the above priors, significantly outperforms many state-of-the-art methods using generic image priors or class-specific exemplars.

11.
Artigo em Inglês | MEDLINE | ID: mdl-30072324

RESUMO

Prevalent techniques in zero-shot learning do not generalize well to other related problem scenarios. Here, we present a unified approach for conventional zero-shot, generalized zero-shot and few-shot learning problems. Our approach is based on a novel Class Adapting Principal Directions (CAPD) concept that allows multiple embeddings of image features into a semantic space. Given an image, our method produces one principal direction for each seen class. Then, it learns how to combine these directions to obtain the principal direction for each unseen class such that the CAPD of the test image is aligned with the semantic embedding of the true class, and opposite to the other classes. This allows efficient and class-adaptive information transfer from seen to unseen classes. In addition, we propose an automatic process for selection of the most useful seen classes for each unseen class to achieve robustness in zero-shot learning. Our method can update the unseen CAPD taking the advantages of few unseen images to work in a few-shot learning scenario. Furthermore, our method can generalize the seen CAPDs by estimating seen-unseen diversity that significantly improves the performance of generalized zero-shot learning. Our extensive evaluations demonstrate that the proposed approach consistently achieves superior performance in zero-shot, generalized zero-shot and few/one-shot learning problems.

12.
IEEE Trans Neural Netw Learn Syst ; 29(9): 4339-4346, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-29990173

RESUMO

Despite its attractive properties, the performance of the recently introduced Keep It Simple and Straightforward MEtric learning (KISSME) method is greatly dependent on principal component analysis as a preprocessing step. This dependence can lead to difficulties, e.g., when the dimensionality is not meticulously set. To address this issue, we devise a unified formulation for joint dimensionality reduction and metric learning based on the KISSME algorithm. Our joint formulation is expressed as an optimization problem on the Grassmann manifold, and hence enjoys the properties of Riemannian optimization techniques. Following the success of deep learning in recent years, we also devise end-to-end learning of a generic deep network for metric learning using our derivation.

13.
IEEE Trans Neural Netw Learn Syst ; 29(10): 4769-4781, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-29990266

RESUMO

We propose a method that obtains a discriminative visual dictionary and a nonlinear classifier for visual tracking tasks in a sparse coding manner based on the globally linear approximation for a nonlinear learning theory. Traditional discriminative tracking methods based on sparse representation learn a dictionary in an unsupervised way and then train a classifier, which may not generate both descriptive and discriminative models for targets by treating dictionary learning and classifier learning separately. In contrast, the proposed tracking approach can construct a dictionary that fully reflects the intrinsic manifold structure of visual data and introduces more discriminative ability in a unified learning framework. Finally, an iterative optimization approach, which computes the optimal dictionary, the associated sparse coding, and a classifier, is introduced. Experiments on two benchmarks show that our tracker achieves a better performance compared with some popular tracking algorithms.

14.
Artigo em Inglês | MEDLINE | ID: mdl-29993770

RESUMO

We introduce a semi-supervised video segmentation approach based on an efficient video representation, called as "super-trajectory". A super-trajectory corresponds to a group of compact point trajectories that exhibit consistent motion patterns, similar appearances, and close spatiotemporal relationships. We generate the compact trajectories using a probabilistic model, which enables handling of occlusions and drifts effectively. To reliably group point trajectories, we adopt a modified version of the density peaks based clustering algorithm that allows capturing rich spatiotemporal relations among trajectories in the clustering process. We incorporate two intuitive mechanisms for segmentation, called as reverse-tracking and object re-occurrence, for robustness and boosting the performance. Building on the proposed video representation, our segmentation method is discriminative enough to accurately propagate the initial annotations in the first frame onto the remaining frames. Our extensive experimental analyses on three challenging benchmarks demonstrate that, given the annotation in the first frame, our method is capable of extracting the target objects from complex backgrounds, and even reidentifying them after prolonged occlusions, producing high-quality video object segments.

15.
IEEE Trans Neural Netw Learn Syst ; 29(11): 5701-5712, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-29994290

RESUMO

Core to many learning pipelines is visual recognition such as image and video classification. In such applications, having a compact yet rich and informative representation plays a pivotal role. An underlying assumption in traditional coding schemes [e.g., sparse coding (SC)] is that the data geometrically comply with the Euclidean space. In other words, the data are presented to the algorithm in vector form and Euclidean axioms are fulfilled. This is of course restrictive in machine learning, computer vision, and signal processing, as shown by a large number of recent studies. This paper takes a further step and provides a comprehensive mathematical framework to perform coding in curved and non-Euclidean spaces, i.e., Riemannian manifolds. To this end, we start by the simplest form of coding, namely, bag of words. Then, inspired by the success of vector of locally aggregated descriptors in addressing computer vision problems, we will introduce its Riemannian extensions. Finally, we study Riemannian form of SC, locality-constrained linear coding, and collaborative coding. Through rigorous tests, we demonstrate the superior performance of our Riemannian coding schemes against the state-of-the-art methods on several visual classification tasks, including head pose classification, video-based face recognition, and dynamic scene recognition.

16.
IEEE Trans Image Process ; 27(6): 2747-2761, 2018 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-29553927

RESUMO

We tackle the challenge of constructing 64 pixels for each individual pixel of a thumbnail face image. We show that such an aggressive super-resolution objective can be attained by taking advantage of the global context and making the best use of the prior information portrayed by the image class. Our input image is so small (e.g., pixels) that it can be considered as a patch of itself. Thus, conventional patch-matching-based super-resolution solutions are unsuitable. In order to enhance the resolution while enforcing the global context, we incorporate a pixel-wise appearance similarity objective into a deconvolutional neural network, which allows efficient learning of mappings between low-resolution input images and their high-resolution counterparts in the training data set. Furthermore, the deconvolutional network blends the learned high-resolution constituent parts in an authentic manner, where the face structure is naturally imposed and the global context is preserved. To account for the possible artifacts in upsampled feature maps, we employ a sub-network composed of additional convolutional layers. During training, we use roughly aligned images (only eye locations), yet demonstrate that our network has the capacity to super-resolve face images regardless of pose and facial expression variations. This significantly reduces the requirement of precisely face alignments in the data set. Owing to the network topology we apply, our method is robust to translational misalignments. In addition, our method is able to upsample rotational unaligned faces with data augmentation. Our extensive experimental analysis manifests that our method achieves more appealing and superior results than the state of the art.

17.
IEEE Trans Pattern Anal Mach Intell ; 40(1): 20-33, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-28166489

RESUMO

Video saliency, aiming for estimation of a single dominant object in a sequence, offers strong object-level cues for unsupervised video object segmentation. In this paper, we present a geodesic distance based technique that provides reliable and temporally consistent saliency measurement of superpixels as a prior for pixel-wise labeling. Using undirected intra-frame and inter-frame graphs constructed from spatiotemporal edges or appearance and motion, and a skeleton abstraction step to further enhance saliency estimates, our method formulates the pixel-wise segmentation task as an energy minimization problem on a function that consists of unary terms of global foreground and background models, dynamic location models, and pairwise terms of label smoothness potentials. We perform extensive quantitative and qualitative experiments on benchmark datasets. Our method achieves superior performance in comparison to the current state-of-the-art in terms of accuracy and speed.

18.
IEEE Trans Image Process ; 26(12): 5645-5655, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28858791

RESUMO

Conventional video segmentation approaches rely heavily on appearance models. Such methods often use appearance descriptors that have limited discriminative power under complex scenarios. To improve the segmentation performance, this paper presents a pyramid histogram-based confidence map that incorporates structure information into appearance statistics. It also combines geodesic distance-based dynamic models. Then, it employs an efficient measure of uncertainty propagation using local classifiers to determine the image regions, where the object labels might be ambiguous. The final foreground cutout is obtained by refining on the uncertain regions. Additionally, to reduce manual labeling, our method determines the frames to be labeled by the human operator in a principled manner, which further boosts the segmentation performance and minimizes the labeling effort. Our extensive experimental analyses on two big benchmarks demonstrate that our solution achieves superior performance, favorable computational efficiency, and reduced manual labeling in comparison to the state of the art.

19.
IEEE Trans Image Process ; 26(12): 5800-5810, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28858801

RESUMO

In this paper, we present a novel part-based visual tracking method from the perspective of probability sampling. Specifically, we represent the target by a part space with two online learned probabilities to capture the structure of the target. The proposal distribution memorizes the historical performance of different parts, and it is used for the first round of part selection. The acceptance probability validates the specific tracking stability of each part in a frame, and it determines whether to accept its vote or to reject it. By doing this, we transform the complex online part selection problem into a probability learning one, which is easier to tackle. The observation model of each part is constructed by an improved supervised descent method and is learned in an incremental manner. Experimental results on two benchmarks demonstrate the competitive performance of our tracker against state-of-the-art methods.

20.
IEEE Trans Image Process ; 26(12): 5922-5935, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28858805

RESUMO

In recent years, sparse representation-based classification (SRC) is one of the most successful methods and has been shown impressive performance in various classification tasks. However, when the training data have a different distribution than the testing data, the learned sparse representation may not be optimal, and the performance of SRC will be degraded significantly. To address this problem, in this paper, we propose an optimal couple projections for domain-adaptive SRC (OCPD-SRC) method, in which the discriminative features of data in the two domains are simultaneously learned with the dictionary that can succinctly represent the training and testing data in the projected space. OCPD-SRC is designed based on the decision rule of SRC, with the objective to learn coupled projection matrices and a common discriminative dictionary such that the between-class sparse reconstruction residuals of data from both domains are maximized, and the within-class sparse reconstruction residuals of data are minimized in the projected low-dimensional space. Thus, the resulting representations can well fit SRC and simultaneously have a better discriminant ability. In addition, our method can be easily extended to multiple domains and can be kernelized to deal with the nonlinear structure of data. The optimal solution for the proposed method can be efficiently obtained following the alternative optimization method. Extensive experimental results on a series of benchmark databases show that our method is better or comparable to many state-of-the-art methods.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...