RESUMO
Humans rely heavily on the shape of objects to recognise them. Recently, it has been argued that Convolutional Neural Networks (CNNs) can also show a shape-bias, provided their learning environment contains this bias. This has led to the proposal that CNNs provide good mechanistic models of shape-bias and, more generally, human visual processing. However, it is also possible that humans and CNNs show a shape-bias for very different reasons, namely, shape-bias in humans may be a consequence of architectural and cognitive constraints whereas CNNs show a shape-bias as a consequence of learning the statistics of the environment. We investigated this question by exploring shape-bias in humans and CNNs when they learn in a novel environment. We observed that, in this new environment, humans (i) focused on shape and overlooked many non-shape features, even when non-shape features were more diagnostic, (ii) learned based on only one out of multiple predictive features, and (iii) failed to learn when global features, such as shape, were absent. This behaviour contrasted with the predictions of a statistical inference model with no priors, showing the strong role that shape-bias plays in human feature selection. It also contrasted with CNNs that (i) preferred to categorise objects based on non-shape features, and (ii) increased reliance on these non-shape features as they became more predictive. This was the case even when the CNN was pre-trained to have a shape-bias and the convolutional backbone was frozen. These results suggest that shape-bias has a different source in humans and CNNs: while learning in CNNs is driven by the statistical properties of the environment, humans are highly constrained by their previous biases, which suggests that cognitive constraints play a key role in how humans learn to recognise novel objects.
Assuntos
Redes Neurais de Computação , Percepção Visual , Viés , Cegueira , Humanos , AprendizagemRESUMO
On several key issues we agree with the commentators. Perhaps most importantly, everyone seems to agree that psychology has an important role to play in building better models of human vision, and (most) everyone agrees (including us) that deep neural networks (DNNs) will play an important role in modelling human vision going forward. But there are also disagreements about what models are for, how DNN-human correspondences should be evaluated, the value of alternative modelling approaches, and impact of marketing hype in the literature. In our view, these latter issues are contributing to many unjustified claims regarding DNN-human correspondences in vision and other domains of cognition. We explore all these issues in this response.
Assuntos
Cognição , Redes Neurais de Computação , HumanosRESUMO
Nonword pronunciation is a critical challenge for models of reading aloud but little attention has been given to identifying the best method for assessing model predictions. The most typical approach involves comparing the model's pronunciations of nonwords to pronunciations of the same nonwords by human participants and deeming the model's output correct if it matches with any transcription of the human pronunciations. The present paper introduces a new ratings-based method, in which participants are shown printed nonwords and asked to rate the plausibility of the provided pronunciations, generated here by a speech synthesiser. We demonstrate this method with reference to a previously published database of 915 disyllabic nonwords (Mousikou et al., 2017). We evaluated two well-known psychological models, RC00 and CDP++, as well as an additional grapheme-to-phoneme algorithm known as Sequitur, and compared our model assessment with the corpus-based method adopted by Mousikou et al. We find that the ratings method: a) is much easier to implement than a corpus-based method, b) has a high hit rate and low false-alarm rate in assessing nonword reading accuracy, and c) provided a similar outcome as the corpus-based method in its assessment of RC00 and CDP++. However, the two methods differed in their evaluation of Sequitur, which performed much better under the ratings method. Indeed, our evaluation of Sequitur revealed that the corpus-based method introduced a number of false positives and more often, false negatives. Implications of these findings are discussed.
Assuntos
Fonética , Leitura , Humanos , Atenção , Modelos Psicológicos , AlgoritmosRESUMO
Same-different visual reasoning is a basic skill central to abstract combinatorial thought. This fact has lead neural networks researchers to test same-different classification on deep convolutional neural networks (DCNNs), which has resulted in a controversy regarding whether this skill is within the capacity of these models. However, most tests of same-different classification rely on testing on images that come from the same pixel-level distribution as the training images, yielding the results inconclusive. In this study, we tested relational same-different reasoning in DCNNs. In a series of simulations we show that models based on the ResNet architecture are capable of visual same-different classification, but only when the test images are similar to the training images at the pixel level. In contrast, when there is a shift in the testing distribution that does not change the relation between the objects in the image, the performance of DCNNs decreases substantially. This finding is true even when the DCNNs' training regime is expanded to include images taken from a wide range of different pixel-level distributions or when the model is trained on the testing distribution but on a different task in a multitask learning context. Furthermore, we show that the relation network, a deep learning architecture specifically designed to tackle visual relational reasoning problems, suffers the same kind of limitations. Overall, the results of this study suggest that learning same-different relations is beyond the scope of current DCNNs.
Assuntos
Redes Neurais de Computação , HumanosRESUMO
Deep neural networks (DNNs) have had extraordinary successes in classifying photographic images of objects and are often described as the best models of biological vision. This conclusion is largely based on three sets of findings: (1) DNNs are more accurate than any other model in classifying images taken from various datasets, (2) DNNs do the best job in predicting the pattern of human errors in classifying objects taken from various behavioral datasets, and (3) DNNs do the best job in predicting brain signals in response to images taken from various brain datasets (e.g., single cell responses or fMRI data). However, these behavioral and brain datasets do not test hypotheses regarding what features are contributing to good predictions and we show that the predictions may be mediated by DNNs that share little overlap with biological vision. More problematically, we show that DNNs account for almost no results from psychological research. This contradicts the common claim that DNNs are good, let alone the best, models of human object recognition. We argue that theorists interested in developing biologically plausible models of human vision need to direct their attention to explaining psychological findings. More generally, theorists need to build models that explain the results of experiments that manipulate independent variables designed to test hypotheses rather than compete on making the best predictions. We conclude by briefly summarizing various promising modeling approaches that focus on psychological data.
Assuntos
Redes Neurais de Computação , Percepção Visual , Humanos , Percepção Visual/fisiologia , Visão Ocular , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia , Imageamento por Ressonância Magnética/métodosRESUMO
There is widespread agreement in neuroscience and psychology that the visual system identifies objects and faces based on a pattern of activation over many neurons, each neuron being involved in representing many different categories. The hypothesis that the visual system includes finely tuned neurons for specific objects or faces for the sake of identification, so-called "grandmother cells", is widely rejected. Here it is argued that the rejection of grandmother cells is premature. Grandmother cells constitute a hypothesis of how familiar visual categories are identified, but the primary evidence against this hypothesis comes from studies that have failed to observe neurons that selectively respond to unfamiliar stimuli. These findings are reviewed and it is shown that they are irrelevant. Neuroscientists need to better understand existing models of face and object identification that include grandmother cells and then compare the selectivity of these units with single neurons responding to stimuli that can be identified.
Assuntos
Biologia Computacional , Neurônios/fisiologia , Reconhecimento Psicológico/fisiologia , Percepção Visual/fisiologia , Animais , Face , Reconhecimento Facial/fisiologia , Haplorrinos/psicologia , Humanos , Memória de Curto Prazo/fisiologia , Modelos Neurológicos , Recompensa , Córtex Visual/fisiologiaRESUMO
Visual translation tolerance refers to our capacity to recognize objects over a wide range of different retinal locations. Although translation is perhaps the simplest spatial transform that the visual system needs to cope with, the extent to which the human visual system can identify objects at previously unseen locations is unclear, with some studies reporting near complete invariance over 10 degrees and other reporting zero invariance at 4 degrees of visual angle. Similarly, there is confusion regarding the extent of translation tolerance in computational models of vision, as well as the degree of match between human and model performance. Here, we report a series of eye-tracking studies (total N = 70) demonstrating that novel objects trained at one retinal location can be recognized at high accuracy rates following translations up to 18 degrees. We also show that standard deep convolutional neural networks (DCNNs) support our findings when pretrained to classify another set of stimuli across a range of locations, or when a global average pooling (GAP) layer is added to produce larger receptive fields. Our findings provide a strong constraint for theories of human vision and help explain inconsistent findings previously reported with convolutional neural networks (CNNs).
Assuntos
Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Reconhecimento Visual de Modelos/fisiologia , Aprendizado Profundo , Feminino , Humanos , Masculino , Adulto JovemRESUMO
Reading involves a process of matching an orthographic input with stored representations in lexical memory. The masked priming paradigm has become a standard tool for investigating this process. Use of existing results from this paradigm can be limited by the precision of the data and the need for cross-experiment comparisons that lack normal experimental controls. Here, we present a single, large, high-precision, multicondition experiment to address these problems. Over 1,000 participants from 14 sites responded to 840 trials involving 28 different types of orthographically related primes (e.g., castfe-CASTLE) in a lexical decision task, as well as completing measures of spelling and vocabulary. The data were indeed highly sensitive to differences between conditions: After correction for multiple comparisons, prime type condition differences of 2.90 ms and above reached significance at the 5% level. This article presents the method of data collection and preliminary findings from these data, which included replications of the most widely agreed-upon differences between prime types, further evidence for systematic individual differences in susceptibility to priming, and new evidence regarding lexical properties associated with a target word's susceptibility to priming. These analyses will form a basis for the use of these data in quantitative model fitting and evaluation and for future exploration of these data that will inform and motivate new experiments.
Assuntos
Bases de Dados Factuais , Reconhecimento Visual de Modelos/fisiologia , Mascaramento Perceptivo/fisiologia , Leitura , Priming de Repetição/fisiologia , Análise de Variância , Humanos , Individualidade , Idioma , Memória , Tempo de Reação , VocabulárioRESUMO
Humans are particularly sensitive to relationships between parts of objects. It remains unclear why this is. One hypothesis is that relational features are highly diagnostic of object categories and emerge as a result of learning to classify objects. We tested this by analyzing the internal representations of supervised convolutional neural networks (CNNs) trained to classify large sets of objects. We found that CNNs do not show the same sensitivity to relational changes as previously observed for human participants. Furthermore, when we precisely controlled the deformations to objects, human behavior was best predicted by the number of relational changes while CNNs were equally sensitive to all changes. Even changing the statistics of the learning environment by making relations uniquely diagnostic did not make networks more sensitive to relations in general. Our results show that learning to classify objects is not sufficient for the emergence of human shape representations. Instead, these results suggest that humans are selectively sensitive to relational changes because they build representations of distal objects from their retinal images and interpret relational changes as changes to these distal objects. This inferential process makes human shape representations qualitatively different from those of artificial neural networks optimized to perform image classification. (PsycInfo Database Record (c) 2023 APA, all rights reserved).
Assuntos
Aprendizagem , Redes Neurais de Computação , HumanosRESUMO
Natural and artificial audition can in principle acquire different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would potentially enrich artificial hearing systems and process models of the mind and brain. Speech recognition - an area ripe for such exploration - is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimulus-computable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting alternative directions for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.
Assuntos
Percepção da Fala , Fala , Humanos , Redes Neurais de Computação , EncéfaloRESUMO
Convolutional neural networks (CNNs) are often described as promising models of human vision, yet they show many differences from human abilities. We focus on a superhuman capacity of top-performing CNNs, namely, their ability to learn very large datasets of random patterns. We verify that human learning on such tasks is extremely limited, even with few stimuli. We argue that the performance difference is due to CNNs' overcapacity and introduce biologically inspired mechanisms to constrain it, while retaining the good test set generalisation to structured images as characteristic of CNNs. We investigate the efficacy of adding noise to hidden units' activations, restricting early convolutional layers with a bottleneck, and using a bounded activation function. Internal noise was the most potent intervention and the only one which, by itself, could reduce random data performance in the tested models to chance levels. We also investigated whether networks with biologically inspired capacity constraints show improved generalisation to out-of-distribution stimuli, however little benefit was observed. Our results suggest that constraining networks with biologically motivated mechanisms paves the way for closer correspondence between network and human performance, but the few manipulations we have tested are only a small step towards that goal.
Assuntos
Aprendizagem , Redes Neurais de Computação , Humanos , Generalização PsicológicaRESUMO
A universal property of visual word identification is position-invariant letter identification, such that the letter "A" is coded in the same way in CAT and ACT. This should provide a fundamental constraint on theories of word identification, and, indeed, it inspired some of the theories that Frost has criticized. I show how the spatial coding scheme of Colin Davis (2010) can, in principle, account for contrasting transposed letter (TL) priming effects, and at the same time, position-invariant letter identification.
Assuntos
Encéfalo/fisiologia , Modelos Neurológicos , Leitura , Reconhecimento Psicológico/fisiologia , Semântica , HumanosRESUMO
Humans can identify objects following various spatial transformations such as scale and viewpoint. This extends to novel objects, after a single presentation at a single pose, sometimes referred to as online invariance. CNNs have been proposed as a compelling model of human vision, but their ability to identify objects across transformations is typically tested on held-out samples of trained categories after extensive data augmentation. This paper assesses whether standard CNNs can support human-like online invariance by training models to recognize images of synthetic 3D objects that undergo several transformations: rotation, scaling, translation, brightness, contrast, and viewpoint. Through the analysis of models' internal representations, we show that standard supervised CNNs trained on transformed objects can acquire strong invariances on novel classes even when trained with as few as 50 objects taken from 10 classes. This extended to a different dataset of photographs of real objects. We also show that these invariances can be acquired in a self-supervised way, through solving the same/different task. We suggest that this latter approach may be similar to how humans acquire invariances.
Assuntos
Aprendizagem , Redes Neurais de Computação , Humanos , RotaçãoRESUMO
Deep Convolutional Neural Networks (DNNs) have achieved superhuman accuracy on standard image classification benchmarks. Their success has reignited significant interest in their use as models of the primate visual system, bolstered by claims of their architectural and representational similarities. However, closer scrutiny of these models suggests that they rely on various forms of shortcut learning to achieve their impressive performance, such as using texture rather than shape information. Such superficial solutions to image recognition have been shown to make DNNs brittle in the face of more challenging tests such as noise-perturbed or out-of-distribution images, casting doubt on their similarity to their biological counterparts. In the present work, we demonstrate that adding fixed biological filter banks, in particular banks of Gabor filters, helps to constrain the networks to avoid reliance on shortcuts, making them develop more structured internal representations and more tolerance to noise. Importantly, they also gained around 20-35% improved accuracy when generalising to our novel out-of-distribution test image sets over standard end-to-end trained architectures. We take these findings to suggest that these properties of the primate visual system should be incorporated into DNNs to make them more able to cope with real-world vision and better capture some of the more impressive aspects of human visual perception such as generalisation.
Assuntos
Redes Neurais de Computação , Percepção Visual , Animais , Generalização Psicológica , Reconhecimento Psicológico , Visão OcularRESUMO
There is growing interest in the role that morphological knowledge plays in literacy acquisition, but there is no research directly comparing the efficacy of different forms of morphological instruction. Here we compare two methods of teaching English morphology in the context of a memory experiment when words were organized by affix during study (e.g., a list of words was presented that all share an affix, such as
Assuntos
Educação/métodos , Alfabetização/tendências , Ensino/educação , Educação/tendências , Feminino , Humanos , Idioma , Linguística/métodos , Masculino , Leitura , Adulto JovemRESUMO
Combinatorial generalization-the ability to understand and produce novel combinations of already familiar elements-is considered to be a core capacity of the human mind and a major challenge to neural network models. A significant body of research suggests that conventional neural networks cannot solve this problem unless they are endowed with mechanisms specifically engineered for the purpose of representing symbols. In this paper, we introduce a novel way of representing symbolic structures in connectionist terms-the vectors approach to representing symbols (VARS), which allows training standard neural architectures to encode symbolic knowledge explicitly at their output layers. In two simulations, we show that neural networks not only can learn to produce VARS representations, but in doing so they achieve combinatorial generalization in their symbolic and non-symbolic output. This adds to other recent work that has shown improved combinatorial generalization under some training conditions, and raises the question of whether specific mechanisms or training routines are needed to support symbolic processing. This article is part of the theme issue 'Towards mechanistic models of meaning composition'.
Assuntos
Redes Neurais de Computação , Simbolismo , Simulação por Computador , Humanos , AprendizagemRESUMO
Deep convolutional neural networks (DCNNs) are frequently described as the best current models of human and primate vision. An obvious challenge to this claim is the existence of adversarial images that fool DCNNs but are uninterpretable to humans. However, recent research has suggested that there may be similarities in how humans and DCNNs interpret these seemingly nonsense images. We reanalysed data from a high-profile paper and conducted five experiments controlling for different ways in which these images can be generated and selected. We show human-DCNN agreement is much weaker and more variable than previously reported, and that the weak agreement is contingent on the choice of adversarial images and the design of the experiment. Indeed, we find there are well-known methods of generating images for which humans show no agreement with DCNNs. We conclude that adversarial images still pose a challenge to theorists using DCNNs as models of human vision.
Assuntos
Visão Ocular/fisiologia , Humanos , Redes Neurais de ComputaçãoRESUMO
When deep convolutional neural networks (CNNs) are trained "end-to-end" on raw data, some of the feature detectors they develop in their early layers resemble the representations found in early visual cortex. This result has been used to draw parallels between deep learning systems and human visual perception. In this study, we show that when CNNs are trained end-to-end they learn to classify images based on whatever feature is predictive of a category within the dataset. This can lead to bizarre results where CNNs learn idiosyncratic features such as high-frequency noise-like masks. In the extreme case, our results demonstrate image categorisation on the basis of a single pixel. Such features are extremely unlikely to play any role in human object recognition, where experiments have repeatedly shown a strong preference for shape. Through a series of empirical studies with standard high-performance CNNs, we show that these networks do not develop a shape-bias merely through regularisation methods or more ecologically plausible training regimes. These results raise doubts over the assumption that simply learning end-to-end in standard CNNs leads to the emergence of similar representations to the human visual system. In the second part of the paper, we show that CNNs are less reliant on these idiosyncratic features when we forgo end-to-end learning and introduce hard-wired Gabor filters designed to mimic early visual processing in V1.
Assuntos
Redes Neurais de Computação , Percepção Visual , HumanosRESUMO
Various methods of measuring unit selectivity have been developed with the aim of better understanding how neural networks work. But the different measures provide divergent estimates of selectivity, and this has led to different conclusions regarding the conditions in which selective object representations are learned and the functional relevance of these representations. In an attempt to better characterize object selectivity, we undertake a comparison of various selectivity measures on a large set of units in AlexNet, including localist selectivity, precision, class-conditional mean activity selectivity (CCMAS), the human interpretation of activation maximization (AM) images, and standard signal-detection measures. We find that the different measures provide different estimates of object selectivity, with precision and CCMAS measures providing misleadingly high estimates. Indeed, the most selective units had a poor hit-rate or a high false-alarm rate (or both) in object classification, making them poor object detectors. We fail to find any units that are even remotely as selective as the 'grandmother cell' units reported in recurrent neural networks. In order to generalize these results, we compared selectivity measures on units in VGG-16 and GoogLeNet trained on the ImageNet or Places-365 datasets that have been described as 'object detectors'. Again, we find poor hit-rates and high false-alarm rates for object classification. We conclude that signal-detection measures provide a better assessment of single-unit selectivity compared to common alternative approaches, and that deep convolutional networks of image classification do not learn object detectors in their hidden layers.
Assuntos
Redes Neurais de Computação , HumanosRESUMO
A fundamental claim associated with parallel distributed processing (PDP) theories of cognition is that knowledge is coded in a distributed manner in mind and brain. This approach rejects the claim that knowledge is coded in a localist fashion, with words, objects, and simple concepts (e.g. "dog"), that is, coded with their own dedicated representations. One of the putative advantages of this approach is that the theories are biologically plausible. Indeed, advocates of the PDP approach often highlight the close parallels between distributed representations learned in connectionist models and neural coding in brain and often dismiss localist (grandmother cell) theories as biologically implausible. The author reviews a range a data that strongly challenge this claim and shows that localist models provide a better account of single-cell recording studies. The author also contrast local and alternative distributed coding schemes (sparse and coarse coding) and argues that common rejection of grandmother cell theories in neuroscience is due to a misunderstanding about how localist models behave. The author concludes that the localist representations embedded in theories of perception and cognition are consistent with neuroscience; biology only calls into question the distributed representations often learned in PDP models.