Búsqueda | Portal de Búsqueda de la BVS España

1.

Look twice: A generalist computational model predicts return fixations across tasks and species.

Zhang, Mengmi; Armendariz, Marcelo; Xiao, Will; Rose, Olivia; Bendtz, Katarina; Livingstone, Margaret; Ponce, Carlos; Kreiman, Gabriel.

PLoS Comput Biol ; 18(11): e1010654, 2022 11.

Artículo en Inglés | MEDLINE | ID: mdl-36413523

RESUMEN

Primates constantly explore their surroundings via saccadic eye movements that bring different parts of an image into high resolution. In addition to exploring new regions in the visual field, primates also make frequent return fixations, revisiting previously foveated locations. We systematically studied a total of 44,328 return fixations out of 217,440 fixations. Return fixations were ubiquitous across different behavioral tasks, in monkeys and humans, both when subjects viewed static images and when subjects performed natural behaviors. Return fixations locations were consistent across subjects, tended to occur within short temporal offsets, and typically followed a 180-degree turn in saccadic direction. To understand the origin of return fixations, we propose a proof-of-principle, biologically-inspired and image-computable neural network model. The model combines five key modules: an image feature extractor, bottom-up saliency cues, task-relevant visual features, finite inhibition-of-return, and saccade size constraints. Even though there are no free parameters that are fine-tuned for each specific task, species, or condition, the model produces fixation sequences resembling the universal properties of return fixations. These results provide initial steps towards a mechanistic understanding of the trade-off between rapid foveal recognition and the need to scrutinize previous fixation locations.

Asunto(s)

Fijación Ocular , Movimientos Sacádicos , Animales , Humanos , Campos Visuales , Primates , Señales (Psicología)

2.

Tuned Compositional Feature Replays for Efficient Stream Learning.

Talbot, Morgan B; Zawar, Rushikesh; Badkundri, Rohil; Zhang, Mengmi; Kreiman, Gabriel.

IEEE Trans Neural Netw Learn Syst ; PP2023 Dec 25.

Artículo en Inglés | MEDLINE | ID: mdl-38145511

RESUMEN

Our brains extract durable, generalizable knowledge from transient experiences of the world. Artificial neural networks come nowhere close to this ability. When tasked with learning to classify objects by training on nonrepeating video frames in temporal order (online stream learning), models that learn well from shuffled datasets catastrophically forget old knowledge upon learning new stimuli. We propose a new continual learning algorithm, compositional replay using memory blocks (CRUMB), which mitigates forgetting by replaying feature maps reconstructed by combining generic parts. CRUMB concatenates trainable and reusable memory block vectors to compositionally reconstruct feature map tensors in convolutional neural networks (CNNs). Storing the indices of memory blocks used to reconstruct new stimuli enables memories of the stimuli to be replayed during later tasks. This reconstruction mechanism also primes the neural network to minimize catastrophic forgetting by biasing it toward attending to information about object shapes more than information about image textures and stabilizes the network during stream learning by providing a shared feature-level basis for all training examples. These properties allow CRUMB to outperform an otherwise identical algorithm that stores and replays raw images while occupying only 3.6% as much memory. We stress-tested CRUMB alongside 13 competing methods on seven challenging datasets. To address the limited number of existing online stream learning datasets, we introduce two new benchmarks by adapting existing datasets for stream learning. With only 3.7%-4.1% as much memory and 15%-43% as much runtime, CRUMB mitigates catastrophic forgetting more effectively than the state-of-the-art. Our code is available at https://github.com/MorganBDT/crumb.git.

3.

Learning to Learn: How to Continuously Teach Humans and Machines.

Singh, Parantak; Li, You; Sikarwar, Ankur; Lei, Weixian; Gao, Difei; Talbot, Morgan B; Sun, Ying; Shou, Mike Zheng; Kreiman, Gabriel; Zhang, Mengmi.

IEEE Int Conf Comput Vis Workshops ; 2023: 11674-11685, 2023 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-38784111

RESUMEN

Curriculum design is a fundamental component of education. For example, when we learn mathematics at school, we build upon our knowledge of addition to learn multiplication. These and other concepts must be mastered before our first algebra lesson, which also reinforces our addition and multiplication skills. Designing a curriculum for teaching either a human or a machine shares the underlying goal of maximizing knowledge transfer from earlier to later tasks, while also minimizing forgetting of learned tasks. Prior research on curriculum design for image classification focuses on the ordering of training examples during a single offline task. Here, we investigate the effect of the order in which multiple distinct tasks are learned in a sequence. We focus on the online class-incremental continual learning setting, where algorithms or humans must learn image classes one at a time during a single pass through a dataset. We find that curriculum consistently influences learning outcomes for humans and for multiple continual machine learning algorithms across several benchmark datasets. We introduce a novel-object recognition dataset for human curriculum learning experiments and observe that curricula that are effective for humans are highly correlated with those that are effective for machines. As an initial step towards automated curriculum design for online class-incremental learning, we propose a novel algorithm, dubbed Curriculum Designer (CD), that designs and ranks curricula based on inter-class feature similarities. We find significant overlap between curricula that are empirically highly effective and those that are highly ranked by our CD. Our study establishes a framework for further research on teaching humans and machines to learn continuously using optimized curricula. Our code and data are available through this link.

4.

When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes.

Bomatter, Philipp; Zhang, Mengmi; Karev, Dimitar; Madan, Spandan; Tseng, Claire; Kreiman, Gabriel.

IEEE Int Conf Comput Vis Workshops ; 2021: 255-264, 2021 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-36051852

RESUMEN

Context is of fundamental importance to both human and machine vision; e.g., an object in the air is more likely to be an airplane than a pig. The rich notion of context incorporates several aspects including physics rules, statistical co-occurrences, and relative object sizes, among others. While previous work has focused on crowd-sourced out-of-context photographs from the web to study scene context, controlling the nature and extent of contextual violations has been a daunting task. Here we introduce a diverse, synthetic Out-of-Context Dataset (OCD) with fine-grained control over scene context. By leveraging a 3D simulation engine, we systematically control the gravity, object co-occurrences and relative sizes across 36 object categories in a virtual household environment. We conducted a series of experiments to gain insights into the impact of contextual cues on both human and machine vision using OCD. We conducted psychophysics experiments to establish a human benchmark for out-of-context recognition, and then compared it with state-of-the-art computer vision models to quantify the gap between the two. We propose a context-aware recognition transformer model, fusing object and contextual information via multi-head attention. Our model captures useful information for contextual reasoning, enabling human-level performance and better robustness in out-of-context conditions compared to baseline models across OCD and other out-of-context datasets. All source code and data are publicly available at https://github.com/kreimanlab/WhenPigsFlyContext.

5.

Visual Search Asymmetry: Deep Nets and Humans Share Similar Inherent Biases.

Gupta, Shashi Kant; Zhang, Mengmi; Wu, Chia-Chien; Wolfe, Jeremy M; Kreiman, Gabriel.

Adv Neural Inf Process Syst ; 34: 6946-6959, 2021 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-36062138

RESUMEN

Visual search is a ubiquitous and often challenging daily task, exemplified by looking for the car keys at home or a friend in a crowd. An intriguing property of some classical search tasks is an asymmetry such that finding a target A among distractors B can be easier than finding B among A. To elucidate the mechanisms responsible for asymmetry in visual search, we propose a computational model that takes a target and a search image as inputs and produces a sequence of eye movements until the target is found. The model integrates eccentricity-dependent visual recognition with target-dependent top-down cues. We compared the model against human behavior in six paradigmatic search tasks that show asymmetry in humans. Without prior exposure to the stimuli or task-specific training, the model provides a plausible mechanism for search asymmetry. We hypothesized that the polarity of search asymmetry arises from experience with the natural environment. We tested this hypothesis by training the model on augmented versions of ImageNet where the biases of natural images were either removed or reversed. The polarity of search asymmetry disappeared or was altered depending on the training protocol. This study highlights how classical perceptual properties can emerge in neural network models, without the need for task-specific training, but rather as a consequence of the statistical properties of the developmental diet fed to the model. All source code and data are publicly available at https://github.com/kreimanlab/VisualSearchAsymmetry.

6.

Putting visual object recognition in context.

Zhang, Mengmi; Tseng, Claire; Kreiman, Gabriel.

Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit ; 2020: 12982-12991, 2020 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-34566393

RESUMEN

Context plays an important role in visual recognition. Recent studies have shown that visual recognition networks can be fooled by placing objects in inconsistent contexts (e.g. a cow in the ocean). To understand and model the role of contextual information in visual recognition, we systematically and quantitatively investigated ten critical properties of where, when, and how context modulates recognition including amount of context, context and object resolution, geometrical structure of context, context congruence, time required to incorporate contextual information, and temporal dynamics of contextual modulation. The tasks involve recognizing a target object surrounded with context in a natural image. As an essential benchmark, we first describe a series of psychophysics experiments, where we alter one aspect of context at a time, and quantify human recognition accuracy. To computationally assess performance on the same tasks, we propose a biologically inspired context aware object recognition model consisting of a two-stream architecture. The model processes visual information at the fovea and periphery in parallel, dynamically incorporates both object and contextual information, and sequentially reasons about the class label for the target object. Across a wide range of behavioral tasks, the model approximates human level performance without retraining for each task, captures the dependence of context enhancement on image properties, and provides initial steps towards integrating scene and object information for visual recognition.

7.

Anticipating Where People will Look Using Adversarial Networks.

Zhang, Mengmi; Ma, Keng Teck; Lim, Joo Hwee; Zhao, Qi; Feng, Jiashi.

IEEE Trans Pattern Anal Mach Intell ; 41(8): 1783-1796, 2019 08.

Artículo en Inglés | MEDLINE | ID: mdl-30273143

RESUMEN

We introduce a new problem of gaze anticipation on future frames which extends the conventional gaze prediction problem to go beyond current frames. To solve this problem, we propose a new generative adversarial network based model, Deep Future Gaze (DFG), encompassing two pathways: DFG-P is to anticipate gaze prior maps conditioned on the input frame which provides task influences; DFG-G is to learn to model both semantic and motion information in future frame generation. DFG-P and DFG-G are then fused to anticipate future gazes. DFG-G consists of two networks: a generator and a discriminator. The generator uses a two-stream spatial-temporal convolution architecture (3D-CNN) for explicitly untangling the foreground and background to generate future frames. It then attaches another 3D-CNN for gaze anticipation based on these synthetic frames. The discriminator plays against the generator by distinguishing the synthetic frames of the generator from the real frames. Experimental results on the publicly available egocentric and third person video datasets show that DFG significantly outperforms all competitive baselines. We also demonstrate that DFG achieves better performance of gaze prediction on current frames in egocentric and third person videos than state-of-the-art methods.

8.

Finding any Waldo with zero-shot invariant and efficient visual search.

Zhang, Mengmi; Feng, Jiashi; Ma, Keng Teck; Lim, Joo Hwee; Zhao, Qi; Kreiman, Gabriel.

Nat Commun ; 9(1): 3730, 2018 09 13.

Artículo en Inglés | MEDLINE | ID: mdl-30213937

RESUMEN

Searching for a target object in a cluttered scene constitutes a fundamental challenge in daily vision. Visual search must be selective enough to discriminate the target from distractors, invariant to changes in the appearance of the target, efficient to avoid exhaustive exploration of the image, and must generalize to locate novel target objects with zero-shot training. Previous work on visual search has focused on searching for perfect matches of a target after extensive category-specific training. Here, we show for the first time that humans can efficiently and invariantly search for natural objects in complex scenes. To gain insight into the mechanisms that guide visual search, we propose a biologically inspired computational model that can locate targets without exhaustive sampling and which can generalize to novel objects. The model provides an approximation to the mechanisms integrating bottom-up and top-down signals during search in natural scenes.

Asunto(s)

Atención , Reconocimiento Visual de Modelos , Visión Ocular , Percepción Visual/fisiología , Adulto , Simulación por Computador , Señales (Psicología) , Femenino , Humanos , Masculino , Psicofísica , Tiempo de Reacción , Factores de Tiempo , Adulto Joven

9.

Beauty is in the eye of the machine.

Zhang, Mengmi; Kreiman, Gabriel.

Nat Hum Behav ; 5(6): 675-676, 2021 06.

Artículo en Inglés | MEDLINE | ID: mdl-34017096

Asunto(s)

Belleza , Humanos

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA