RESUMO
Spatio-temporal patterns of evoked brain activity contain information that can be used to decode and categorize the semantic content of visual stimuli. However, this procedure can be biased by low-level image features independently of the semantic content present in the stimuli, prompting the need to understand the robustness of different models regarding these confounding factors. In this study, we trained machine learning models to distinguish between concepts included in the publicly available THINGS-EEG dataset using electroencephalography (EEG) data acquired during a rapid serial visual presentation paradigm. We investigated the contribution of low-level image features to decoding accuracy in a multivariate model, utilizing broadband data from all EEG channels. Additionally, we explored a univariate model obtained through data-driven feature selection applied to the spatial and frequency domains. While the univariate models exhibited better decoding accuracy, their predictions were less robust to the confounding effect of low-level image statistics. Notably, some of the models maintained their accuracy even after random replacement of the training dataset with semantically unrelated samples that presented similar low-level content. In conclusion, our findings suggest that model optimization impacts sensitivity to confounding factors, regardless of the resulting classification performance. Therefore, the choice of EEG features for semantic decoding should ideally be informed by criteria beyond classifier performance, such as the neurobiological mechanisms under study.
Assuntos
Eletroencefalografia , Semântica , Humanos , Eletroencefalografia/métodos , Feminino , Masculino , Adulto , Adulto Jovem , Aprendizado de Máquina , Encéfalo/fisiologiaRESUMO
Competing theories attempt to explain what guides eye movements when exploring natural scenes: bottom-up image salience and top-down semantic salience. In one study, we apply language-based analyses to quantify the well-known observation that task influences gaze in natural scenes. Subjects viewed ten scenes as if they were performing one of two tasks. We found that the semantic similarity between the task and the labels of objects in the scenes captured the task-dependence of gaze (t(39) = 13.083; p < 0.001). In another study, we examined whether image salience or semantic salience better predicts gaze during a search task, and if viewing strategies are affected by searching for targets of high or low semantic relevance to the scene. Subjects searched 100 scenes for a high- or low-relevance object. We found that image salience becomes a worse predictor of gaze across successive fixations, while semantic salience remains a consistent predictor (X2(1, N=40) = 75.148, p < .001). Furthermore, we found that semantic salience decreased as object relevance decreased (t(39) = 2.304; p = .027). These results suggest that semantic salience is a useful predictor of gaze during task-related scene viewing, and that even in target-absent trials, gaze is modulated by the relevance of a search target to the scene in which it might be located.
Assuntos
Atenção , Fixação Ocular , Semântica , Humanos , Fixação Ocular/fisiologia , Atenção/fisiologia , Masculino , Feminino , Adulto Jovem , Adulto , Reconhecimento Visual de Modelos/fisiologia , Movimentos Oculares/fisiologiaRESUMO
Previous work has demonstrated similarities and differences between aerial and terrestrial image viewing. Aerial scene categorization, a pivotal visual processing task for gathering geoinformation, heavily depends on rotation-invariant information. Aerial image-centered research has revealed effects of low-level features on performance of various aerial image interpretation tasks. However, there are fewer studies of viewing behavior for aerial scene categorization and of higher-level factors that might influence that categorization. In this paper, experienced subjects' eye movements were recorded while they were asked to categorize aerial scenes. A typical viewing center bias was observed. Eye movement patterns varied among categories. We explored the relationship of nine image statistics to observers' eye movements. Results showed that if the images were less homogeneous, and/or if they contained fewer or no salient diagnostic objects, viewing behavior became more exploratory. Higher- and object-level image statistics were predictive at both the image and scene category levels. Scanpaths were generally organized and small differences in scanpath randomness could be roughly captured by critical object saliency. Participants tended to fixate on critical objects. Image statistics included in this study showed rotational invariance. The results supported our hypothesis that the availability of diagnostic objects strongly influences eye movements in this task. In addition, this study provides supporting evidence for Loschky et al.'s (Journal of Vision, 15(6), 11, 2015) speculation that aerial scenes are categorized on the basis of image parts and individual objects. The findings were discussed in relation to theories of scene perception and their implications for automation development.
Assuntos
Movimentos Oculares , Percepção Visual , Humanos , Estimulação Luminosa/métodos , Automação , RegistrosRESUMO
Humans have well-documented priors for many features present in nature that guide visual perception. Despite being putatively grounded in the statistical regularities of the environment, scene priors are frequently violated due to the inherent variability of visual features from one scene to the next. However, these repeated violations do not appreciably challenge visuo-cognitive function, necessitating the broad use of priors in conjunction with context-specific information. We investigated the trade-off between participants' internal expectations formed from both longer-term priors and those formed from immediate contextual information using a perceptual inference task and naturalistic stimuli. Notably, our task required participants to make perceptual inferences about naturalistic images using their own internal criteria, rather than making comparative judgements. Nonetheless, we show that observers' performance is well approximated by a model that makes inferences using a prior for low-level image statistics, aggregated over many images. We further show that the dependence on this prior is rapidly re-weighted against contextual information, even when misleading. Our results therefore provide insight into how apparent high-level interpretations of scene appearances follow from the most basic of perceptual processes, which are grounded in the statistics of natural images.
Assuntos
Julgamento , Percepção Visual , Humanos , CogniçãoRESUMO
What makes objects alike in the human mind? Computational approaches for characterizing object similarity have largely focused on the visual forms of objects or their linguistic associations. However, intuitive notions of object similarity may depend heavily on contextual reasoning-that is, objects may be grouped together in the mind if they occur in the context of similar scenes or events. Using large-scale analyses of natural scene statistics and human behavior, we found that a computational model of the associations between objects and their scene contexts is strongly predictive of how humans spontaneously group objects by similarity. Specifically, we learned contextual prototypes for a diverse set of object categories by taking the average response of a convolutional neural network (CNN) to the scene contexts in which the objects typically occurred. In behavioral experiments, we found that contextual prototypes were strongly predictive of human similarity judgments for a large set of objects and rivaled the performance of models based on CNN representations of the objects themselves or word embeddings for their names. Together, our findings reveal the remarkable degree to which the natural statistics of context predict commonsense notions of object similarity.
Assuntos
Julgamento , Redes Neurais de Computação , Humanos , Julgamento/fisiologia , Estimulação Luminosa , Aprendizagem , Resolução de Problemas , Reconhecimento Visual de Modelos/fisiologiaRESUMO
The perception of depth from retinal images depends on information from multiple visual cues. One potential depth cue is the statistical relationship between luminance and distance; darker points in a local region of an image tend to be farther away than brighter points. We establish that this statistical relationship acts as a quantitative cue to depth. We show that luminance variations affect depth in naturalistic scenes containing multiple cues to depth. This occurred when the correlation between variations of luminance and depth was manipulated within an object, but not between objects. This is consistent with the local nature of the statistical relationship in natural scenes. We also showed that perceived depth increases as contrast is increased, but only when the depth signalled by luminance and binocular disparity are consistent. Our results show that the negative correlation between luminance and distance, as found under diffuse lighting, provides a depth cue that is combined with depth from binocular disparity, in a way that is consistent with the simultaneous estimation of surface depth and reflectance variations. Adopting more complex lighting models such as ambient occlusion in computer rendering will thus contribute to the accuracy as well as the aesthetic appearance of three-dimensional graphics.
RESUMO
With the continued growth of digital device use, a greater portion of the visual world experienced daily by many people has shifted towards digital environments. The "oblique effect" denotes a bias for horizontal and vertical (canonical) contours over oblique contours, which is derived from a disproportionate exposure to canonical content. Carpentered environments have been shown to possess proportionally more canonical than oblique contours, leading to perceptual bias in those who live in "built" environments. Likewise, there is potential for orientation sensitivity to be shaped by frequent exposure to digital content. The potential influence of digital content on the oblique effect was investigated by measuring the degree of orientation anisotropy from a range of digital scenes using Fourier analysis. Content from popular cartoons, video games, and social communication websites was compared to real-life nature, suburban, and urban scenes. Findings suggest that digital content varies widely in orientation anisotropy, but pixelated video games and social communication websites were found to exhibit a degree of orientation anisotropy substantially exceeding that observed in all measured categories of real-world environments. Therefore, the potential may exist for digital content to induce an even greater shift in orientation bias than has been observed in previous research. This potential, and implications of such a shift, is discussed.
Assuntos
Orientação , Jogos de Vídeo , Humanos , Percepção Visual , Viés , AnisotropiaRESUMO
Detecting object boundaries is crucial for recognition, but how the process unfolds in visual cortex remains unknown. To study the problem faced by a hypothetical boundary cell, and to predict how cortical circuitry could produce a boundary cell from a population of conventional "simple cells," we labeled 30,000 natural image patches and used Bayes' rule to help determine how a simple cell should influence a nearby boundary cell depending on its relative offset in receptive field position and orientation. We identified the following three basic types of cell-cell interactions: rising and falling interactions with a range of slopes and saturation rates, and nonmonotonic (bump-shaped) interactions with varying modes and amplitudes. Using simple models, we show that a ubiquitous cortical circuit motif consisting of direct excitation and indirect inhibition-a compound effect we call "incitation"-can produce the entire spectrum of simple cell-boundary cell interactions found in our dataset. Moreover, we show that the synaptic weights that parameterize an incitation circuit can be learned by a single-layer "delta" rule. We conclude that incitatory interconnections are a generally useful computing mechanism that the cortex may exploit to help solve difficult natural classification problems.SIGNIFICANCE STATEMENT Simple cells in primary visual cortex (V1) respond to oriented edges and have long been supposed to detect object boundaries, yet the prevailing model of a simple cell-a divisively normalized linear filter-is a surprisingly poor natural boundary detector. To understand why, we analyzed image statistics on and off object boundaries, allowing us to characterize the neural-style computations needed to perform well at this difficult natural classification task. We show that a simple circuit motif known to exist in V1 is capable of extracting high-quality boundary probability signals from local populations of simple cells. Our findings suggest a new, more general way of conceptualizing cell-cell interconnections in the cortex.
Assuntos
Córtex Visual , Teorema de Bayes , Reconhecimento Psicológico , Aprendizagem , Comunicação CelularRESUMO
With the development of digital imaging techniques, image quality assessment methods are receiving more attention in the literature. Since distortion-free versions of camera images in many practical, everyday applications are not available, the need for effective no-reference image quality assessment algorithms is growing. Therefore, this paper introduces a novel no-reference image quality assessment algorithm for the objective evaluation of authentically distorted images. Specifically, we apply a broad spectrum of local and global feature vectors to characterize the variety of authentic distortions. Among the employed local features, the statistics of popular local feature descriptors, such as SURF, FAST, BRISK, or KAZE, are proposed for NR-IQA; other features are also introduced to boost the performances of local features. The proposed method was compared to 12 other state-of-the-art algorithms on popular and accepted benchmark datasets containing RGB images with authentic distortions (CLIVE, KonIQ-10k, and SPAQ). The introduced algorithm significantly outperforms the state-of-the-art in terms of correlation with human perceptual quality ratings.
RESUMO
Visual texture is an important cue to figure-ground organization. While processing of texture differences is a prerequisite for the use of this cue to extract figure-ground organization, these stages are distinct processes. One potential indicator of this distinction is the possibility that texture statistics play a different role in the figure vs. in the ground. To determine whether this is the case, we probed figure-ground processing with a family of local image statistics that specified textures that varied in the strength and spatial scale of structure, and the extent to which features are oriented. For image statistics that generated approximately isotropic textures, the threshold for identification of figure-ground structure was determined by the difference in correlation strength in figure vs. ground, independent of whether the correlations were present in figure, ground, or both. However, for image statistics with strong orientation content, thresholds were up to two times higher for correlations in the ground, vs. the figure. This held equally for texture-defined objects with convex or concave boundaries, indicating that these threshold differences are driven by border ownership, not boundary shape. Similar threshold differences were found for presentation times ranging from 125 to 500 ms. These findings identify a qualitative difference in how texture is used for figure-ground analysis, vs. texture discrimination. Additionally, it reveals a functional recursion: texture differences are needed to identify tentative boundaries and consequent scene organization into figure and ground, but then scene organization modifies sensitivity to texture differences according to the figure-ground assignment.
Assuntos
Sinais (Psicologia) , Reconhecimento Visual de Modelos , HumanosRESUMO
In conventional psychophysical reverse correlation methods using white or pink noise, the luminance noise is added to every pixel. Thus, the image features correlated with perception are often biased toward local mean luminance. Furthermore, spatial frequencies and orientations are represented in the primary visual cortex, which forms the basis of various visual perception. In this study, we proposed a new reverse correlation method using noise that modulated the spatial frequency sub-band contrast and examined its properties in psychophysical experiments on facial skin lightness perception. In the experiment, we asked the observers to compare the perceived skin lightness in a paired comparison manner on face stimuli with increased or decreased spatial frequency sub-band contrasts at random spatial locations. The results showed that the contrasts in the eyes or irises were strongly and positively correlated with the perceived skin lightness in most sub-bands, demonstrating that the proposed method reiterated the findings of previous studies that the sparkle of the irises makes the skin appear lighter. Contrarily, the conventional reverse correlation method using pink noise images was applied to the skin lightness perception. The results indicated that only the local mean luminance in some skin regions, such as the forehead, was correlated with skin lightness perception. In summary, with the proposed method, we found some image features in the facial parts other than the skin mean luminance relevant to skin lightness perception, which are difficult to detect using the conventional method. They are considered complementary given that the proposed method and the conventional method extracted considerably different image features. It depends on the psychophysical tasks and stimuli which one is more appropriate.
Assuntos
Sensibilidades de Contraste , Luz , Correlação de Dados , Humanos , Ruído , Percepção VisualRESUMO
We developed an image-computable observer model of the initial visual encoding that operates on natural image input, based on the framework of Bayesian image reconstruction from the excitations of the retinal cone mosaic. Our model extends previous work on ideal observer analysis and evaluation of performance beyond psychophysical discrimination, takes into account the statistical regularities of the visual environment, and provides a unifying framework for answering a wide range of questions regarding the visual front end. Using the error in the reconstructions as a metric, we analyzed variations of the number of different photoreceptor types on human retina as an optimal design problem. In addition, the reconstructions allow both visualization and quantification of information loss due to physiological optics and cone mosaic sampling, and how these vary with eccentricity. Furthermore, in simulations of color deficiencies and interferometric experiments, we found that the reconstructed images provide a reasonable proxy for modeling subjects' percepts. Lastly, we used the reconstruction-based observer for the analysis of psychophysical threshold, and found notable interactions between spatial frequency and chromatic direction in the resulting spatial contrast sensitivity function. Our method is widely applicable to experiments and applications in which the initial visual encoding plays an important role.
Assuntos
Simulação por Computador , Processamento de Imagem Assistida por Computador/métodos , Células Fotorreceptoras Retinianas Cones/fisiologia , Visão Ocular/fisiologia , Percepção Visual/fisiologia , Teorema de Bayes , Percepção de Cores/fisiologia , Sensibilidades de Contraste , Humanos , Estimulação Luminosa , SoftwareRESUMO
Many studies use different categories of images to define their conditions. Since any difference between these categories is a valid candidate to explain category-related behavioral differences, knowledge about the objective image differences between categories is crucial for the interpretation of the behaviors. However, natural images vary in many image features and not every feature is equally important in describing the differences between the categories. Here, we provide a methodological approach to find as many of the image features as possible, using machine learning performance as a tool, that have predictive value over the category the images belong to. In other words, we describe a means to find the features of a group of images by which the categories can be objectively and quantitatively defined. Note that we are not aiming to provide a means for the best possible decoding performance; instead, our aim is to uncover prototypical characteristics of the categories. To facilitate the use of this method, we offer an open-source, MATLAB-based toolbox that performs such an analysis and aids the user in visualizing the features of relevance. We first applied the toolbox to a mock data set with a ground truth to show the sensitivity of the approach. Next, we applied the toolbox to a set of natural images as a more practical example.
Assuntos
Aprendizado de Máquina , Humanos , Expressão Facial , Imageamento TridimensionalRESUMO
Technological advances in recent decades have allowed us to measure both the information available to the visual system in the natural environment and the rich array of behaviors that the visual system supports. This review highlights the tasks undertaken by the binocular visual system in particular and how, for much of human activity, these tasks differ from those considered when an observer fixates a static target on the midline. The everyday motor and perceptual challenges involved in generating a stable, useful binocular percept of the environment are discussed, together with how these challenges are but minimally addressed by much of current clinical interpretation of binocular function. The implications for new technology, such as virtual reality, are also highlighted in terms of clinical and basic research application.
Assuntos
Percepção de Profundidade , Visão Binocular , Meio Ambiente , HumanosRESUMO
The visual system represents textural image regions as simple statistics that are useful for the rapid perception of scenes and surfaces. What images 'textures' are, however, has so far mostly been subjectively defined. The present study investigated the empirical conditions under which natural images are processed as texture. We first show that 'texturality' - i.e., whether or not an image is perceived as a texture - is strongly correlated with the perceived similarity between an original image and its Portilla-Simoncelli (PS) synthesized image. We found that both judgments are highly correlated with specific PS statistics of the image. We also demonstrate that a discriminant model based on a small set of image statistics could discriminate whether a given image was perceived as a texture with over 90% accuracy. The results provide a method to determine whether a given image region is represented statistically by the human visual system.
RESUMO
Efficient processing of sensory data requires adapting the neuronal encoding strategy to the statistics of natural stimuli. Previously, in Hermundstad et al., 2014, we showed that local multipoint correlation patterns that are most variable in natural images are also the most perceptually salient for human observers, in a way that is compatible with the efficient coding principle. Understanding the neuronal mechanisms underlying such adaptation to image statistics will require performing invasive experiments that are impossible in humans. Therefore, it is important to understand whether a similar phenomenon can be detected in animal species that allow for powerful experimental manipulations, such as rodents. Here we selected four image statistics (from single- to four-point correlations) and trained four groups of rats to discriminate between white noise patterns and binary textures containing variable intensity levels of one of such statistics. We interpreted the resulting psychometric data with an ideal observer model, finding a sharp decrease in sensitivity from two- to four-point correlations and a further decrease from four- to three-point. This ranking fully reproduces the trend we previously observed in humans, thus extending a direct demonstration of efficient coding to a species where neuronal and developmental processes can be interrogated and causally manipulated.
Assuntos
Discriminação Psicológica/fisiologia , Reconhecimento Visual de Modelos/fisiologia , Percepção Visual/fisiologia , Animais , Comportamento Animal/fisiologia , Condicionamento Operante , Masculino , Ratos Long-EvansRESUMO
The primate visual system analyzes statistical information in natural images and uses it for the immediate perception of scenes, objects, and surface materials. To investigate the dynamical encoding of image statistics in the human brain, we measured visual evoked potentials (VEPs) for 166 natural textures and their synthetic versions, and performed a reverse-correlation analysis of the VEPs and representative texture statistics of the image. The analysis revealed occipital VEP components strongly correlated with particular texture statistics. VEPs correlated with low-level statistics, such as subband SDs, emerged rapidly from 100 to 250 ms in a spatial frequency dependent manner. VEPs correlated with higher-order statistics, such as subband kurtosis and cross-band correlations, were observed at slightly later times. Moreover, these robust correlations enabled us to inversely estimate texture statistics from VEP signals via linear regression and to reconstruct texture images that appear similar to those synthesized with the original statistics. Additionally, we found significant differences in VEPs at 200-300 ms between some natural textures and their Portilla-Simoncelli (PS) synthesized versions, even though they shared almost identical texture statistics. This differential VEP was related to the perceptual "unnaturalness" of PS-synthesized textures. These results suggest that the visual cortex rapidly encodes image statistics hidden in natural textures specifically enough to predict the visual appearance of a texture, while it also represents high-level information beyond image statistics, and that electroencephalography can be used to decode these cortical signals.
RESUMO
Texture information plays a critical role in the rapid perception of scenes, objects, and materials. Here, we propose a novel model in which visual texture perception is essentially determined by the 1st-order (2D-luminance) and 2nd-order (4D-energy) spectra. This model is an extension of the dimensionality of the Filter-Rectify-Filter (FRF) model, and it also corresponds to the frequency representation of the Portilla-Simoncelli (PS) statistics. We show that preserving two spectra and randomizing phases of a natural texture image result in a perceptually similar texture, strongly supporting the model. Based on only two single spectral spaces, this model provides a simpler framework to describe and predict texture representations in the primate visual system. The idea of multi-order spectral analysis is consistent with the hierarchical processing principle of the visual cortex, which is approximated by a multi-layer convolutional network.
RESUMO
The perceptual quality of digital images is often deteriorated during storage, compression, and transmission. The most reliable way of assessing image quality is to ask people to provide their opinions on a number of test images. However, this is an expensive and time-consuming process which cannot be applied in real-time systems. In this study, a novel no-reference image quality assessment method is proposed. The introduced method uses a set of novel quality-aware features which globally characterizes the statistics of a given test image, such as extended local fractal dimension distribution feature, extended first digit distribution features using different domains, Bilaplacian features, image moments, and a wide variety of perceptual features. Experimental results are demonstrated on five publicly available benchmark image quality assessment databases: CSIQ, MDID, KADID-10k, LIVE In the Wild, and KonIQ-10k.
RESUMO
Visual texture, defined by local image statistics, provides important information to the human visual system for perceptual segmentation. Second-order or spectral statistics (equivalent to the Fourier power spectrum) are a well-studied segmentation cue. However, the role of higher-order statistics (HOS) in segmentation remains unclear, particularly for natural images. Recent experiments indicate that, in peripheral vision, the HOS of the widely adopted Portilla-Simoncelli texture model are a weak segmentation cue compared to spectral statistics, despite the fact that both are necessary to explain other perceptual phenomena and to support high-quality texture synthesis. Here we test whether this discrepancy reflects a property of natural image statistics. First, we observe that differences in spectral statistics across segments of natural images are redundant with differences in HOS. Second, using linear and nonlinear classifiers, we show that each set of statistics individually affords high performance in natural scenes and texture segmentation tasks, but combining spectral statistics and HOS produces relatively small improvements. Third, we find that HOS improve segmentation for a subset of images, although these images are difficult to identify. We also find that different subsets of HOS improve segmentation to a different extent, in agreement with previous physiological and perceptual work. These results show that the HOS add modestly to spectral statistics for natural image segmentation. We speculate that tuning to natural image statistics under resource constraints could explain the weak contribution of HOS to perceptual segmentation in human peripheral vision.