Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
1.
PLoS Comput Biol ; 16(4): e1007698, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32271746

RESUMEN

Humans are able to track multiple objects at any given time in their daily activities-for example, we can drive a car while monitoring obstacles, pedestrians, and other vehicles. Several past studies have examined how humans track targets simultaneously and what underlying behavioral and neural mechanisms they use. At the same time, computer-vision researchers have proposed different algorithms to track multiple targets automatically. These algorithms are useful for video surveillance, team-sport analysis, video analysis, video summarization, and human-computer interaction. Although there are several efficient biologically inspired algorithms in artificial intelligence, the human multiple-target tracking (MTT) ability is rarely imitated in computer-vision algorithms. In this paper, we review MTT studies in neuroscience and biologically inspired MTT methods in computer vision and discuss the ways in which they can be seen as complementary.


Asunto(s)
Inteligencia Artificial , Memoria/fisiología , Visión Ocular/fisiología , Algoritmos , Animales , Encéfalo/fisiología , Cognición , Humanos , Procesamiento de Imagen Asistido por Computador/métodos , Movimiento (Física) , Neurociencias , Grabación en Video/métodos
2.
J Vis ; 16(14): 18, 2016 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-27903005

RESUMEN

Several structural scene cues such as gist, layout, horizontal line, openness, and depth have been shown to guide scene perception (e.g., Oliva & Torralba, 2001); Ross & Oliva, 2009). Here, to investigate whether vanishing point (VP) plays a significant role in gaze guidance, we ran two experiments. In the first one, we recorded fixations of 10 observers (six male, four female; mean age 22; SD = 0.84) freely viewing 532 images, out of which 319 had a VP (shuffled presentation; each image for 4 s). We found that the average number of fixations at a local region (80 × 80 pixels) centered at the VP is significantly higher than the average fixations at random locations (t test; n = 319; p < 0.001). To address the confounding factor of saliency, we learned a combined model of bottom-up saliency and VP. The AUC (area under curve) score of our model (0.85; SD = 0.01) is significantly higher than the base saliency model (e.g., 0.8 using attention for information maximization (AIM) model by Bruce & Tsotsos, 2005, t test; p = 3.14e-16) and the VP-only model (0.64, t test; p < 0.001). In the second experiment, we asked 14 subjects (10 male, four female; mean age 23.07, SD = 1.26) to search for a target character (T or L) placed randomly on a 3 × 3 imaginary grid overlaid on top of an image. Subjects reported their answers by pressing one of the two keys. Stimuli consisted of 270 color images (180 with a single VP, 90 without). The target happened with equal probability inside each cell (15 times L, 15 times T). We found that subjects were significantly faster (and more accurate) when the target appeared inside the cell containing the VP compared to cells without the VP (median across 14 subjects 1.34 s vs. 1.96 s; Wilcoxon rank-sum test; p = 0.0014). These findings support the hypothesis that vanishing point, similar to face, text (Cerf, Frady, & Koch, 2009), and gaze direction Borji, Parks, & Itti, 2014) guides attention in free-viewing and visual search tasks.


Asunto(s)
Movimientos Oculares/fisiología , Fijación Ocular/fisiología , Reconocimiento Visual de Modelos/fisiología , Percepción Visual/fisiología , Atención/fisiología , Señales (Psicología) , Femenino , Humanos , Masculino , Probabilidad , Adulto Joven
3.
J Vis ; 14(3): 29, 2014 Mar 24.
Artículo en Inglés | MEDLINE | ID: mdl-24665092

RESUMEN

In a very influential yet anecdotal illustration, Yarbus suggested that human eye-movement patterns are modulated top down by different task demands. While the hypothesis that it is possible to decode the observer's task from eye movements has received some support (e.g., Henderson, Shinkareva, Wang, Luke, & Olejarczyk, 2013; Iqbal & Bailey, 2004), Greene, Liu, and Wolfe (2012) argued against it by reporting a failure. In this study, we perform a more systematic investigation of this problem, probing a larger number of experimental factors than previously. Our main goal is to determine the informativeness of eye movements for task and mental state decoding. We perform two experiments. In the first experiment, we reanalyze the data from a previous study by Greene et al. (2012) and contrary to their conclusion, we report that it is possible to decode the observer's task from aggregate eye-movement features slightly but significantly above chance, using a Boosting classifier (34.12% correct vs. 25% chance level; binomial test, p = 1.0722e - 04). In the second experiment, we repeat and extend Yarbus's original experiment by collecting eye movements of 21 observers viewing 15 natural scenes (including Yarbus's scene) under Yarbus's seven questions. We show that task decoding is possible, also moderately but significantly above chance (24.21% vs. 14.29% chance-level; binomial test, p = 2.4535e - 06). We thus conclude that Yarbus's idea is supported by our data and continues to be an inspiration for future computational and experimental eye-movement research. From a broader perspective, we discuss techniques, features, limitations, societal and technological impacts, and future directions in task decoding from eye movements.


Asunto(s)
Atención , Movimientos Oculares/fisiología , Reconocimiento Visual de Modelos/fisiología , Cognición/fisiología , Femenino , Humanos , Masculino , Desempeño Psicomotor/fisiología , Adulto Joven
4.
J Vis ; 14(13): 3, 2014 Nov 04.
Artículo en Inglés | MEDLINE | ID: mdl-25371549

RESUMEN

Gaze direction provides an important and ubiquitous communication channel in daily behavior and social interaction of humans and some animals. While several studies have addressed gaze direction in synthesized simple scenes, few have examined how it can bias observer attention and how it might interact with early saliency during free viewing of natural and realistic scenes. Experiment 1 used a controlled, staged setting in which an actor was asked to look at two different objects in turn, yielding two images that differed only by the actor's gaze direction, to causally assess the effects of actor gaze direction. Over all scenes, the median probability of following an actor's gaze direction was higher than the median probability of looking toward the single most salient location, and higher than chance. Experiment 2 confirmed these findings over a larger set of unconstrained scenes collected from the Web and containing people looking at objects and/or other people. To further compare the strength of saliency versus gaze direction cues, we computed gaze maps by drawing a cone in the direction of gaze of the actors present in the images. Gaze maps predicted observers' fixation locations significantly above chance, although below saliency. Finally, to gauge the relative importance of actor face and eye directions in guiding observer's fixations, in Experiment 3, observers were asked to guess the gaze direction from only an actor's face region (with the rest of the scene masked), in two conditions: actor eyes visible or masked. Median probability of guessing the true gaze direction within ±9° was significantly higher when eyes were visible, suggesting that the eyes contribute significantly to gaze estimation, in addition to face region. Our results highlight that gaze direction is a strong attentional cue in guiding eye movements, complementing low-level saliency cues, and derived from both face and eyes of actors in the scene. Thus gaze direction should be considered in constructing more predictive visual attention models in the future.


Asunto(s)
Atención/fisiología , Movimientos Oculares/fisiología , Fijación Ocular/fisiología , Reconocimiento Visual de Modelos/fisiología , Señales (Psicología) , Cara , Femenino , Humanos , Masculino , Adulto Joven
5.
J Vis ; 13(10): 18, 2013 Aug 29.
Artículo en Inglés | MEDLINE | ID: mdl-23988384

RESUMEN

Einhäuser, Spain, and Perona (2008) explored an alternative hypothesis to saliency maps (i.e., spatial image outliers) and claimed that "objects predict fixations better than early saliency." To test their hypothesis, they measured eye movements of human observers while they inspected 93 photographs of common natural scenes (Uncommon Places dataset by Shore, Tillman, & Schmidt-Wulen 2004; Supplement Figure S4). Subjects were asked to observe an image and, immediately afterwards, to name objects they saw (remembered). Einhäuser et al. showed that a map made of manually drawn object regions, each object weighted by its recall frequency, predicts fixations in individual images better than early saliency. Due to important implications of this hypothesis, we investigate it further. The core of our analysis is explained here. Please refer to the Supplement for details.


Asunto(s)
Fijación Ocular/fisiología , Modelos Psicológicos , Reconocimiento Visual de Modelos/fisiología , Femenino , Humanos , Masculino
6.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 8006-8021, 2022 11.
Artículo en Inglés | MEDLINE | ID: mdl-34437058

RESUMEN

CNN-based salient object detection (SOD) methods achieve impressive performance. However, the way semantic information is encoded in them and whether they are category-agnostic is less explored. One major obstacle in studying these questions is the fact that SOD models are built on top of the ImageNet pre-trained backbones which may cause information leakage and feature redundancy. To remedy this, here we first propose an extremely light-weight holistic model tied to the SOD task that can be freed from classification backbones and trained from scratch, and then employ it to study the semantics of SOD models. With the holistic network and representation redundancy reduction by a novel dynamic weight decay scheme, our model has only 100K parameters,  âˆ¼  0.2% of parameters of large models, and performs on par with SOTA on popular SOD benchmarks. Using CSNet, we find that a) SOD and classification methods use different mechanisms, b) SOD models are category insensitive, c) ImageNet pre-training is not necessary for SOD training, and d) SOD models require far fewer parameters than the classification models. The source code is publicly available at https://mmcheng.net/sod100k/.


Asunto(s)
Redes Neurales de la Computación , Semántica , Algoritmos
7.
IEEE Trans Neural Netw Learn Syst ; 33(3): 1051-1065, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-33296311

RESUMEN

Deep neural networks are vulnerable to adversarial attacks. More importantly, some adversarial examples crafted against an ensemble of source models transfer to other target models and, thus, pose a security threat to black-box applications (when attackers have no access to the target models). Current transfer-based ensemble attacks, however, only consider a limited number of source models to craft an adversarial example and, thus, obtain poor transferability. Besides, recent query-based black-box attacks, which require numerous queries to the target model, not only come under suspicion by the target model but also cause expensive query cost. In this article, we propose a novel transfer-based black-box attack, dubbed serial-minigroup-ensemble-attack (SMGEA). Concretely, SMGEA first divides a large number of pretrained white-box source models into several "minigroups." For each minigroup, we design three new ensemble strategies to improve the intragroup transferability. Moreover, we propose a new algorithm that recursively accumulates the "long-term" gradient memories of the previous minigroup to the subsequent minigroup. This way, the learned adversarial information can be preserved, and the intergroup transferability can be improved. Experiments indicate that SMGEA not only achieves state-of-the-art black-box attack ability over several data sets but also deceives two online black-box saliency prediction systems in real world, i.e., DeepGaze-II (https://deepgaze.bethgelab.org/) and SALICON (http://salicon.net/demo/). Finally, we contribute a new code repository to promote research on adversarial attack and defense over ubiquitous pixel-to-pixel computer vision tasks. We share our code together with the pretrained substitute model zoo at https://github.com/CZHQuality/AAA-Pix2pix.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Aprendizaje , Memoria a Largo Plazo
8.
IEEE Trans Pattern Anal Mach Intell ; 43(2): 679-700, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-31425064

RESUMEN

Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss the remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models.

9.
IEEE Trans Pattern Anal Mach Intell ; 43(1): 220-237, 2021 01.
Artículo en Inglés | MEDLINE | ID: mdl-31247542

RESUMEN

Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.


Asunto(s)
Aprendizaje Profundo , Algoritmos , Humanos
10.
IEEE Trans Image Process ; 30: 8727-8742, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34613915

RESUMEN

Multi-level feature fusion is a fundamental topic in computer vision. It has been exploited to detect, segment and classify objects at various scales. When multi-level features meet multi-modal cues, the optimal feature aggregation and multi-modal learning strategy become a hot potato. In this paper, we leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel Bifurcated Backbone Strategy Network (BBS-Net). Our architecture, is simple, efficient, and backbone-independent. In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS). Second, we introduce a depth-enhanced module (DEM) to excavate informative depth cues from the channel and spatial views. Then, RGB and depth modalities are fused in a complementary way. Extensive experiments show that BBS-Net significantly outperforms 18 state-of-the-art (SOTA) models on eight challenging datasets under five evaluation measures, demonstrating the superiority of our approach (~4% improvement in S-measure vs . the top-ranked model: DMRA). In addition, we provide a comprehensive analysis on the generalization ability of different RGB-D datasets and provide a powerful training set for future research. The complete algorithm, benchmark results, and post-processing toolbox are publicly available at https://github.com/zyjwuyan/BBS-Net.

11.
IEEE Trans Image Process ; 30: 1973-1988, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33444138

RESUMEN

Saliency detection is an effective front-end process to many security-related tasks, e.g. automatic drive and tracking. Adversarial attack serves as an efficient surrogate to evaluate the robustness of deep saliency models before they are deployed in real world. However, most of current adversarial attacks exploit the gradients spanning the entire image space to craft adversarial examples, ignoring the fact that natural images are high-dimensional and spatially over-redundant, thus causing expensive attack cost and poor perceptibility. To circumvent these issues, this paper builds an efficient bridge between the accessible partially-white-box source models and the unknown black-box target models. The proposed method includes two steps: 1) We design a new partially-white-box attack, which defines the cost function in the compact hidden space to punish a fraction of feature activations corresponding to the salient regions, instead of punishing every pixel spanning the entire dense output space. This partially-white-box attack reduces the redundancy of the adversarial perturbation. 2) We exploit the non-redundant perturbations from some source models as the prior cues, and use an iterative zeroth-order optimizer to compute the directional derivatives along the non-redundant prior directions, in order to estimate the actual gradient of the black-box target model. The non-redundant priors boost the update of some "critical" pixels locating at non-zero coordinates of the prior cues, while keeping other redundant pixels locating at the zero coordinates unaffected. Our method achieves the best tradeoff between attack ability and perturbation redundancy. Finally, we conduct a comprehensive experiment to test the robustness of 18 state-of-the-art deep saliency models against 16 malicious attacks, under both of white-box and black-box settings, which contributes a new robustness benchmark to the saliency community for the first time.

12.
IEEE Trans Pattern Anal Mach Intell ; 42(8): 1913-1927, 2020 08.
Artículo en Inglés | MEDLINE | ID: mdl-30892201

RESUMEN

Previous research in visual saliency has been focused on two major types of models namely fixation prediction and salient object detection. The relationship between the two, however, has been less explored. In this work, we propose to employ the former model type to identify salient objects. We build a novel Attentive Saliency Network (ASNet)1 1.Available at: https://github.com/wenguanwang/ASNet. that learns to detect salient objects from fixations. The fixation map, derived at the upper network layers, mimics human visual attention mechanisms and captures a high-level understanding of the scene from a global view. Salient object detection is then viewed as fine-grained object-level saliency segmentation and is progressively optimized with the guidance of the fixation map in a top-down manner. ASNet is based on a hierarchy of convLSTMs that offers an efficient recurrent mechanism to sequentially refine the saliency features over multiple steps. Several loss functions, derived from existing saliency evaluation metrics, are incorporated to further boost the performance. Extensive experiments on several challenging datasets show that our ASNet outperforms existing methods and is capable of generating accurate segmentation maps with the help of the computed fixation prior. Our work offers a deeper insight into the mechanisms of attention and narrows the gap between salient object detection and fixation prediction.

13.
Artículo en Inglés | MEDLINE | ID: mdl-31905138

RESUMEN

Deep convolutional neural networks (CNNs) have been successfully applied to a wide variety of problems in computer vision, including salient object detection. To accurately detect and segment salient objects, it is necessary to extract and combine high-level semantic features with low-level fine details simultaneously. This is challenging for CNNs because repeated subsampling operations such as pooling and convolution lead to a significant decrease in the feature resolution, which results in the loss of spatial details and finer structures. Therefore, we propose augmenting feedforward neural networks by using the multistage refinement mechanism. In the first stage, a master net is built to generate a coarse prediction map in which most detailed structures are missing. In the following stages, the refinement net with layerwise recurrent connections to the master net is equipped to progressively combine local context information across stages to refine the preceding saliency maps in a stagewise manner. Furthermore, the pyramid pooling module and channel attention module are applied to aggregate different-region-based global contexts. Extensive evaluations over six benchmark datasets show that the proposed method performs favorably against the state-of-the-art approaches.

14.
IEEE Trans Pattern Anal Mach Intell ; 41(6): 1353-1366, 2019 06.
Artículo en Inglés | MEDLINE | ID: mdl-29994045

RESUMEN

Thanks to the availability and increasing popularity of wearable devices such as GoPro cameras, smart phones, and glasses, we have access to a plethora of videos captured from first person perspective. Surveillance cameras and Unmanned Aerial Vehicles (UAVs) also offer tremendous amounts of video data recorded from top and oblique view points. Egocentric and surveillance vision have been studied extensively but separately in the computer vision community. The relationship between these two domains, however, remains unexplored. In this study, we make the first attempt in this direction by addressing two basic yet challenging questions. First, having a set of egocentric videos and a top-view surveillance video, does the top-view video contain all or some of the egocentric viewers? In other words, have these videos been shot in the same environment at the same time? Second, if so, can we identify the egocentric viewers in the top-view video? These problems can become extremely challenging when videos are not temporally aligned. Each view, egocentric or top, is modeled by a graph and the assignment and time-delays are computed iteratively using the spectral graph matching framework. We evaluate our method in terms of ranking and assigning egocentric viewers to identities present in the top-view video over a dataset of 50 top-view and 188 egocentric videos captured under different conditions. We also evaluate the capability of our proposed approaches in terms of temporal alignment. The experiments and results demonstrate the capability of the proposed approaches in terms of jointly addressing the temporal alignment and assignment tasks.

15.
Artículo en Inglés | MEDLINE | ID: mdl-31613763

RESUMEN

Data size is the bottleneck for developing deep saliency models, because collecting eye-movement data is very time-consuming and expensive. Most of current studies on human attention and saliency modeling have used high-quality stereotype stimuli. In real world, however, captured images undergo various types of transformations. Can we use these transformations to augment existing saliency datasets? Here, we first create a novel saliency dataset including fixations of 10 observers over 1900 images degraded by 19 types of transformations. Second, by analyzing eye movements, we find that observers look at different locations over transformed versus original images. Third, we utilize the new data over transformed images, called data augmentation transformation (DAT), to train deep saliency models. We find that label-preserving DATs with negligible impact on human gaze boost saliency prediction, whereas some other DATs that severely impact human gaze degrade the performance. These label-preserving valid augmentation transformations provide a solution to enlarge existing saliency datasets. Finally, we introduce a novel saliency model based on generative adversarial networks (dubbed GazeGAN). A modified U-Net is utilized as the generator of the GazeGAN, which combines classic "skip connection" with a novel "center-surround connection" (CSC) module. Our proposed CSC module mitigates trivial artifacts while emphasizing semantic salient regions, and increases model nonlinearity, thus demonstrating better robustness against transformations. Extensive experiments and comparisons indicate that GazeGAN achieves state-of-the-art performance over multiple datasets. We also provide a comprehensive comparison of 22 saliency models on various transformed scenes, which contributes a new robustness benchmark to saliency community. Our code and dataset are available at.

16.
IEEE Trans Pattern Anal Mach Intell ; 41(4): 815-828, 2019 04.
Artículo en Inglés | MEDLINE | ID: mdl-29993862

RESUMEN

Recent progress on salient object detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs). Semantic segmentation and salient object detection algorithms developed lately have been mostly based on Fully Convolutional Neural Networks (FCNs). There is still a large room for improvement over the generic FCN models that do not explicitly deal with the scale-space problem. The Holistically-Nested Edge Detector (HED) provides a skip-layer structure with deep supervision for edge and boundary detection, but the performance gain of HED on saliency detection is not obvious. In this paper, we propose a new salient object detection method by introducing short connections to the skip-layer structures within the HED architecture. Our framework takes full advantage of multi-level and multi-scale features extracted from FCNs, providing more advanced representations at each layer, a property that is critically needed to perform segment detection. Our method produces state-of-the-art results on 5 widely tested salient object detection benchmarks, with advantages in terms of efficiency (0.08 seconds per image), effectiveness, and simplicity over the existing algorithms. Beyond that, we conduct an exhaustive analysis of the role of training data on performance. We provide a training set for future research and fair comparisons.

17.
Artículo en Inglés | MEDLINE | ID: mdl-29994308

RESUMEN

We propose a novel unsupervised game-theoretic salient object detection algorithm that does not require labeled training data. First, saliency detection problem is formulated as a non-cooperative game, hereinafter referred to as Saliency Game, in which image regions are players who choose to be "background" or "foreground" as their pure strategies. A payoff function is constructed by exploiting multiple cues and combining complementary features. Saliency maps are generated according to each region's strategy in the Nash equilibrium of the proposed Saliency Game. Second, we explore the complementary relationship between color and deep features and propose an Iterative Random Walk algorithm to combine saliency maps produced by the Saliency Game using different features. Iterative random walk allows sharing information across feature spaces, and detecting objects that are otherwise very hard to detect. Extensive experiments over 6 challenging datasets demonstrate the superiority of our proposed unsupervised algorithm compared to several state of the art supervised algorithms.

18.
IEEE Trans Neural Netw Learn Syst ; 27(6): 1214-26, 2016 06.
Artículo en Inglés | MEDLINE | ID: mdl-26452292

RESUMEN

Predicting where people look in natural scenes has attracted a lot of interest in computer vision and computational neuroscience over the past two decades. Two seemingly contrasting categories of cues have been proposed to influence where people look: 1) low-level image saliency and 2) high-level semantic information. Our first contribution is to take a detailed look at these cues to confirm the hypothesis proposed by Henderson and Nuthmann and Henderson that observers tend to look at the center of objects. We analyzed fixation data for scene free-viewing over 17 observers on 60 object-annotated images with various types of objects. Images contained different types of scenes, such as natural scenes, line drawings, and 3-D rendered scenes. Our second contribution is to propose a simple combined model of low-level saliency and object center bias that outperforms each individual component significantly over our data, as well as on the Object and Semantic Images and Eye-tracking data set by Xu et al. The results reconcile saliency with object center-bias hypotheses and highlight that both types of cues are important in guiding fixations. Our work opens new directions to understand strategies that humans use in observing scenes and objects, and demonstrates the construction of combined models of low-level saliency and high-level object-based information.

19.
IEEE Trans Neural Netw Learn Syst ; 27(6): 1266-78, 2016 06.
Artículo en Inglés | MEDLINE | ID: mdl-26277009

RESUMEN

Advances in image quality assessment have shown the potential added value of including visual attention aspects in its objective assessment. Numerous models of visual saliency are implemented and integrated in different image quality metrics (IQMs), but the gain in reliability of the resulting IQMs varies to a large extent. The causes and the trends of this variation would be highly beneficial for further improvement of IQMs, but are not fully understood. In this paper, an exhaustive statistical evaluation is conducted to justify the added value of computational saliency in objective image quality assessment, using 20 state-of-the-art saliency models and 12 best-known IQMs. Quantitative results show that the difference in predicting human fixations between saliency models is sufficient to yield a significant difference in performance gain when adding these saliency models to IQMs. However, surprisingly, the extent to which an IQM can profit from adding a saliency model does not appear to have direct relevance to how well this saliency model can predict human fixations. Our statistical analysis provides useful guidance for applying saliency models in IQMs, in terms of the effect of saliency model dependence, IQM dependence, and image distortion dependence. The testbed and software are made publicly available to the research community.

20.
IEEE Trans Image Process ; 25(4): 1566-79, 2016 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-26829792

RESUMEN

A large number of saliency models, each based on a different hypothesis, have been proposed over the past 20 years. In practice, while subscribing to one hypothesis or computational principle makes a model that performs well on some types of images, it hinders the general performance of a model on arbitrary images and large-scale data sets. One natural approach to improve overall saliency detection accuracy would then be fusing different types of models. In this paper, inspired by the success of late-fusion strategies in semantic analysis and multi-modal biometrics, we propose to fuse the state-of-the-art saliency models at the score level in a para-boosting learning fashion. First, saliency maps generated by several models are used as confidence scores. Then, these scores are fed into our para-boosting learner (i.e., support vector machine, adaptive boosting, or probability density estimator) to generate the final saliency map. In order to explore the strength of para-boosting learners, traditional transformation-based fusion strategies, such as Sum, Min, and Max, are also explored and compared in this paper. To further reduce the computation cost of fusing too many models, only a few of them are considered in the next step. Experimental results show that score-level fusion outperforms each individual model and can further reduce the performance gap between the current models and the human inter-observer model.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA