RESUMO
Human vision is still largely unexplained. Computer vision made impressive progress on this front, but it is still unclear to which extent artificial neural networks approximate human object vision at the behavioral and neural levels. Here, we investigated whether machine object vision mimics the representational hierarchy of human object vision with an experimental design that allows testing within-domain representations for animals and scenes, as well as across-domain representations reflecting their real-world contextual regularities such as animal-scene pairs that often co-occur in the visual environment. We found that DCNNs trained in object recognition acquire representations, in their late processing stage, that closely capture human conceptual judgements about the co-occurrence of animals and their typical scenes. Likewise, the DCNNs representational hierarchy shows surprising similarities with the representational transformations emerging in domain-specific ventrotemporal areas up to domain-general frontoparietal areas. Despite these remarkable similarities, the underlying information processing differs. The ability of neural networks to learn a human-like high-level conceptual representation of object-scene co-occurrence depends upon the amount of object-scene co-occurrence present in the image set thus highlighting the fundamental role of training history. Further, although mid/high-level DCNN layers represent the category division for animals and scenes as observed in VTC, its information content shows reduced domain-specific representational richness. To conclude, by testing within- and between-domain selectivity while manipulating contextual regularities we reveal unknown similarities and differences in the information processing strategies employed by human and artificial visual systems.
Assuntos
Reconhecimento Visual de Modelos , Córtex Visual , Humanos , Mapeamento Encefálico , Imageamento por Ressonância Magnética , Percepção Visual , Estimulação LuminosaRESUMO
Some of the most impressive functional specializations in the human brain are found in the occipitotemporal cortex (OTC), where several areas exhibit selectivity for a small number of visual categories, such as faces and bodies, and spatially cluster based on stimulus animacy. Previous studies suggest this animacy organization reflects the representation of an intuitive taxonomic hierarchy, distinct from the presence of face- and body-selective areas in OTC. Using human functional magnetic resonance imaging, we investigated the independent contribution of these two factors-the face-body division and taxonomic hierarchy-in accounting for the animacy organization of OTC and whether they might also be reflected in the architecture of several deep neural networks that have not been explicitly trained to differentiate taxonomic relations. We found that graded visual selectivity, based on animal resemblance to human faces and bodies, masquerades as an apparent animacy continuum, which suggests that taxonomy is not a separate factor underlying the organization of the ventral visual pathway.SIGNIFICANCE STATEMENT Portions of the visual cortex are specialized to determine whether types of objects are animate in the sense of being capable of self-movement. Two factors have been proposed as accounting for this animacy organization: representations of faces and bodies and an intuitive taxonomic continuum of humans and animals. We performed an experiment to assess the independent contribution of both of these factors. We found that graded visual representations, based on animal resemblance to human faces and bodies, masquerade as an apparent animacy continuum, suggesting that taxonomy is not a separate factor underlying the organization of areas in the visual cortex.
Assuntos
Mapeamento Encefálico , Vida , Redes Neurais de Computação , Lobo Occipital/fisiologia , Lobo Temporal/fisiologia , Adulto , Animais , Face , Feminino , Corpo Humano , Humanos , Julgamento , Imageamento por Ressonância Magnética , Masculino , Aparência Física , Plantas , Distribuição Aleatória , Adulto JovemRESUMO
The ontogenetic development of human vision and the real-time neural processing of visual input exhibit a striking similarity-a sensitivity toward spatial frequencies that progresses in a coarse-to-fine manner. During early human development, sensitivity for higher spatial frequencies increases with age. In adulthood, when humans receive new visual input, low spatial frequencies are typically processed first before subsequent processing of higher spatial frequencies. We investigated to what extent this coarse-to-fine progression might impact visual representations in artificial vision and compared this to adult human representations. We simulated the coarse-to-fine progression of image processing in deep convolutional neural networks (CNNs) by gradually increasing spatial frequency information during training. We compared CNN performance after standard and coarse-to-fine training with a wide range of datasets from behavioral and neuroimaging experiments. In contrast to humans, CNNs that are trained using the standard protocol are very insensitive to low spatial frequency information, showing very poor performance in being able to classify such object images. By training CNNs using our coarse-to-fine method, we improved the classification accuracy of CNNs from 0% to 32% on low-pass-filtered images taken from the ImageNet dataset. The coarse-to-fine training also made the CNNs more sensitive to low spatial frequencies in hybrid images with conflicting information in different frequency bands. When comparing differently trained networks on images containing full spatial frequency information, we saw no representational differences. Overall, this integration of computational, neural, and behavioral findings shows the relevance of the exposure to and processing of inputs with variation in spatial frequency content for some aspects of high-level object representations.
Assuntos
Aprendizado Profundo , Adulto , Humanos , Processamento de Imagem Assistida por Computador , Redes Neurais de Computação , Visão Ocular , Percepção VisualRESUMO
Deep Convolutional Neural Networks (CNNs) are gaining traction as the benchmark model of visual object recognition, with performance now surpassing humans. While CNNs can accurately assign one image to potentially thousands of categories, network performance could be the result of layers that are tuned to represent the visual shape of objects, rather than object category, since both are often confounded in natural images. Using two stimulus sets that explicitly dissociate shape from category, we correlate these two types of information with each layer of multiple CNNs. We also compare CNN output with fMRI activation along the human visual ventral stream by correlating artificial with neural representations. We find that CNNs encode category information independently from shape, peaking at the final fully connected layer in all tested CNN architectures. Comparing CNNs with fMRI brain data, early visual cortex (V1) and early layers of CNNs encode shape information. Anterior ventral temporal cortex encodes category information, which correlates best with the final layer of CNNs. The interaction between shape and category that is found along the human visual ventral pathway is echoed in multiple deep networks. Our results suggest CNNs represent category information independently from shape, much like the human visual system.
Assuntos
Córtex Visual/fisiologia , Percepção Visual , Adulto , Mapeamento Encefálico , Feminino , Humanos , Masculino , Rede Nervosa/fisiologia , Redes Neurais de Computação , Reconhecimento Visual de Modelos , Estimulação Luminosa , Vias Visuais/fisiologia , Adulto JovemRESUMO
PURPOSE: People enjoy supervision during visual field assessment, although resource demands often make this difficult. We evaluated outcomes and subjective experience of methods of receiving feedback during perimetry, with specific goals to compare a humanoid robot to a computerized voice in participants with minimal prior perimetric experience. Human feedback and no feedback also were compared. METHODS: Twenty-two younger (aged 21-31 years) and 18 older (aged 52-76 years) adults participated. Visual field tests were conducted using an Octopus 900, controlled with the Open Perimetry Interface. Participants underwent four tests with the following feedback conditions: (1) human, (2) humanoid robot, (3) computer speaker, and (4) no feedback, in random order. Feedback rules for the speaker and robot were identical, with the difference being a social interaction with the robot before the test. Quantitative perimetric performance compared mean sensitivity (dB), fixation losses, and false-positives. Subjective experience was collected via survey. RESULTS: There was no significant effect of feedback type on the quantitative measures. For younger adults, the human and robot were preferred to the computer speaker (P < 0.01). For older adults, the experience rating was similar for the speaker and robot. No feedback was the least preferred option of 77% younger and 50% older adults. CONCLUSIONS: During perimetry, a social robot was preferred to a computer speaker providing the same feedback, despite the robot not being visible during the test. Making visual field testing more enjoyable for patients and operators may improve compliance and attitude to perimetry, leading to improved clinical outcomes. TRANSLATIONAL RELEVANCE: Our data suggest that humanoid robots can replace some aspects of human interaction during perimetry and are preferable to receiving no human feedback.
RESUMO
Lightness, or perceived reflectance of a surface, is influenced by surrounding context. This is demonstrated by the Simultaneous Contrast Illusion (SCI), where a gray patch is perceived lighter against a black background and vice versa. Conversely, assimilation is where the lightness of the target patch moves toward that of the bounding areas and can be demonstrated in White's effect. Blakeslee and McCourt (1999) introduced an oriented difference-of-Gaussian (ODOG) model that is able to account for both contrast and assimilation in a number of lightness illusions and that has been subsequently improved using localized normalization techniques. We introduce a model inspired by image statistics that is based on a family of exponential filters, with kernels spanning across multiple sizes and shapes. We include an optional second stage of normalization based on contrast gain control. Our model was tested on a well-known set of lightness illusions that have previously been used to evaluate ODOG and its variants, and model lightness values were compared with typical human data. We investigate whether predictive success depends on filters of a particular size or shape and whether pooling information across filters can improve performance. The best single filter correctly predicted the direction of lightness effects for 21 out of 27 illusions. Combining two filters together increased the best performance to 23, with asymptotic performance at 24 for an arbitrarily large combination of filter outputs. While normalization improved prediction magnitudes, it only slightly improved overall scores in direction predictions. The prediction performance of 24 out of 27 illusions equals that of the best performing ODOG variant, with greater parsimony. Our model shows that V1-style orientation-selectivity is not necessary to account for lightness illusions and that a low-level model based on image statistics is able to account for a wide range of both contrast and assimilation effects.
RESUMO
To improve robustness in object recognition, many artificial visual systems imitate the way in which the human visual cortex encodes object information as a hierarchical set of features. These systems are usually evaluated in terms of their ability to accurately categorize well-defined, unambiguous objects and scenes. In the real world, however, not all objects and scenes are presented clearly, with well-defined labels and interpretations. Visual illusions demonstrate a disparity between perception and objective reality, allowing psychophysicists to methodically manipulate stimuli and study our interpretation of the environment. One prominent effect, the Müller-Lyer illusion, is demonstrated when the perceived length of a line is contracted (or expanded) by the addition of arrowheads (or arrow-tails) to its ends. HMAX, a benchmark object recognition system, consistently produces a bias when classifying Müller-Lyer images. HMAX is a hierarchical, artificial neural network that imitates the "simple" and "complex" cell layers found in the visual ventral stream. In this study, we perform two experiments to explore the Müller-Lyer illusion in HMAX, asking: (1) How do simple vs. complex cell operations within HMAX affect illusory bias and precision? (2) How does varying the position of the figures in the input image affect classification using HMAX? In our first experiment, we assessed classification after traversing each layer of HMAX and found that in general, kernel operations performed by simple cells increase bias and uncertainty while max-pooling operations executed by complex cells decrease bias and uncertainty. In our second experiment, we increased variation in the positions of figures in the input images that reduced bias and uncertainty in HMAX. Our findings suggest that the Müller-Lyer illusion is exacerbated by the vulnerability of simple cell operations to positional fluctuations, but ameliorated by the robustness of complex cell responses to such variance.
RESUMO
Studying illusions provides insight into the way the brain processes information. The Müller-Lyer Illusion (MLI) is a classical geometrical illusion of size, in which perceived line length is decreased by arrowheads and increased by arrowtails. Many theories have been put forward to explain the MLI, such as misapplied size constancy scaling, the statistics of image-source relationships and the filtering properties of signal processing in primary visual areas. Artificial models of the ventral visual processing stream allow us to isolate factors hypothesised to cause the illusion and test how these affect classification performance. We trained a feed-forward feature hierarchical model, HMAX, to perform a dual category line length judgment task (short versus long) with over 90% accuracy. We then tested the system in its ability to judge relative line lengths for images in a control set versus images that induce the MLI in humans. Results from the computational model show an overall illusory effect similar to that experienced by human subjects. No natural images were used for training, implying that misapplied size constancy and image-source statistics are not necessary factors for generating the illusion. A post-hoc analysis of response weights within a representative trained network ruled out the possibility that the illusion is caused by a reliance on information at low spatial frequencies. Our results suggest that the MLI can be produced using only feed-forward, neurophysiological connections.