Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 49
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Caries Res ; 56(2): 129-137, 2022.
Article in English | MEDLINE | ID: mdl-35398845

ABSTRACT

Visual attention is a significant gateway to a child's mind, and looking is one of the first behaviors young children develop. Untreated caries and the resulting poor dental aesthetics can have adverse emotional and social impacts on children's oral health-related quality of life due to its detrimental effects on self-esteem and self-concept. Therefore, we explored preschool children's eye movement patterns and visual attention to images with and without dental caries via eye movement analysis using hidden Markov models (EMHMM). We calibrated a convenience sample of 157 preschool children to the eye-tracker (Tobii Nano Pro) to ensure standardization. Consequently, each participant viewed the same standardized pictures with and without dental caries while an eye-tracking device tracked their eye movements. Subsequently, based on the sequence of viewed regions of interest (ROIs), a transition matrix was developed where the participants' previously viewed ROI informed their subsequently considered ROI. Hence, an individual's HMM was estimated from their eye movement data using a variational Bayesian approach to determine the optimal number of ROIs automatically. Consequently, this data-driven approach generated the visual task participants' most representative eye movement patterns. Preschool children exhibited two different eye movement patterns, distributed (78%) and selective (21%), which was statistically significant. Children switched between images with more similar probabilities in the distributed pattern while children remained looking at the same ROI than switching to the other ROI in the selective pattern. Nevertheless, all children exhibited an equal starting fixation on the right or left image and noticed teeth. The study findings reveal that most preschool children did not have an attentional bias to images with and without dental caries. Furthermore, only a few children selectively fixated on images with dental caries. Therefore, selective eye-movement patterns may strongly predict preschool children's sustained visual attention to dental caries. Nevertheless, future studies are essential to fully understand the developmental origins of differences in visual attention to common oral health presentations in children. Finally, EMHMM is appropriate for assessing inter-individual differences in children's visual attention.


Subject(s)
Dental Caries , Bayes Theorem , Child, Preschool , Dental Caries/diagnostic imaging , Eye-Tracking Technology , Humans , Oral Health , Quality of Life
2.
Dent Traumatol ; 38(5): 410-416, 2022 Oct.
Article in English | MEDLINE | ID: mdl-35460595

ABSTRACT

BACKGROUND/AIM: Traumatic dental injuries (TDIs) in the primary dentition may result in tooth discolouration and fractures. The aim of this child-centred study was to explore the differences between preschool children's eye movement patterns and visual attention to typical outcomes following TDIs to primary teeth. MATERIALS AND METHODS: An eye-tracker recorded 155 healthy preschool children's eye movements when they viewed clinical images of healthy teeth, tooth fractures and discolourations. The visual search pattern was analysed using the eye movement analysis with the Hidden Markov Models (EMHMM) approach and preference for the various regions of interest (ROIs). RESULTS: Two different eye movement patterns (distributed and selective) were identified (p < .05). Children with the distributed pattern shifted their fixations between the presented images, while those with the selective pattern remained focused on the same image they first saw. CONCLUSIONS: Preschool children noticed teeth. However, most of them did not have an attentional bias, implying that they did not interpret these TDI outcomes negatively. Only a few children avoided looking at images with TDIs indicating a potential negative impact. The EMHMM approach is appropriate for assessing inter-individual differences in children's visual attention to TDI outcomes.


Subject(s)
Tooth Fractures , Tooth Injuries , Child, Preschool , Eye-Tracking Technology , Humans , Tooth, Deciduous
3.
Behav Res Methods ; 53(6): 2473-2486, 2021 12.
Article in English | MEDLINE | ID: mdl-33929699

ABSTRACT

The eye movement analysis with hidden Markov models (EMHMM) method provides quantitative measures of individual differences in eye-movement pattern. However, it is limited to tasks where stimuli have the same feature layout (e.g., faces). Here we proposed to combine EMHMM with the data mining technique co-clustering to discover participant groups with consistent eye-movement patterns across stimuli for tasks involving stimuli with different feature layouts. Through applying this method to eye movements in scene perception, we discovered explorative (switching between the foreground and background information or different regions of interest) and focused (mainly looking at the foreground with less switching) eye-movement patterns among Asian participants. Higher similarity to the explorative pattern predicted better foreground object recognition performance, whereas higher similarity to the focused pattern was associated with better feature integration in the flanker task. These results have important implications for using eye tracking as a window into individual differences in cognitive abilities and styles. Thus, EMHMM with co-clustering provides quantitative assessments on eye-movement patterns across stimuli and tasks. It can be applied to many other real-life visual tasks, making a significant impact on the use of eye tracking to study cognitive behavior across disciplines.


Subject(s)
Eye Movements , Individuality , Asian People , Cluster Analysis , Humans , Visual Perception
4.
Cogn Emot ; 34(8): 1704-1710, 2020 12.
Article in English | MEDLINE | ID: mdl-32552552

ABSTRACT

Theoretical models propose that attentional biases might account for the maintenance of social anxiety symptoms. However, previous eye-tracking studies have yielded mixed results. One explanation is that existing studies quantify eye-movements using arbitrary, experimenter-defined criteria such as time segments and regions of interests that do not capture the dynamic nature of overt visual attention. The current study adopted the Eye Movement analysis with Hidden Markov Models (EMHMM) approach for eye-movement analysis, a machine-learning, data-driven approach that can cluster people's eye-movements into different strategy groups. Sixty participants high and low in self-reported social anxiety symptoms viewed angry and neutral faces in a free-viewing task while their eye-movements were recorded. EMHMM analyses revealed novel associations between eye-movement patterns and social anxiety symptoms that were not evident with standard analytical approaches. Participants who adopted the same face-viewing strategy when viewing both angry and neutral faces showed higher social anxiety symptoms than those who transitioned between strategies when viewing angry versus neutral faces. EMHMM can offer novel insights into psychopathology-related attention processes.


Subject(s)
Anxiety/psychology , Attentional Bias/physiology , Emotions/physiology , Eye Movements/physiology , Facial Expression , Adult , Anxiety/physiopathology , Female , Hong Kong , Humans , Male , Markov Chains , Students/psychology , Students/statistics & numerical data , Young Adult
5.
Behav Res Methods ; 52(3): 1026-1043, 2020 06.
Article in English | MEDLINE | ID: mdl-31712999

ABSTRACT

Here we propose the eye movement analysis with switching hidden Markov model (EMSHMM) approach to analyzing eye movement data in cognitive tasks involving cognitive state changes. We used a switching hidden Markov model (SHMM) to capture a participant's cognitive state transitions during the task, with eye movement patterns during each cognitive state being summarized using a regular HMM. We applied EMSHMM to a face preference decision-making task with two pre-assumed cognitive states-exploration and preference-biased periods-and we discovered two common eye movement patterns through clustering the cognitive state transitions. One pattern showed both a later transition from the exploration to the preference-biased cognitive state and a stronger tendency to look at the preferred stimulus at the end, and was associated with higher decision inference accuracy at the end; the other pattern entered the preference-biased cognitive state earlier, leading to earlier above-chance inference accuracy in a trial but lower inference accuracy at the end. This finding was not revealed by any other method. As compared with our previous HMM method, which assumes no cognitive state change (i.e., EMHMM), EMSHMM captured eye movement behavior in the task better, resulting in higher decision inference accuracy. Thus, EMSHMM reveals and provides quantitative measures of individual differences in cognitive behavior/style, making a significant impact on the use of eyetracking to study cognitive behavior across disciplines.


Subject(s)
Eye Movements , Face , Humans , Individuality , Markov Chains , Probability
6.
Behav Res Methods ; 50(1): 362-379, 2018 02.
Article in English | MEDLINE | ID: mdl-28409487

ABSTRACT

How people look at visual information reveals fundamental information about them; their interests and their states of mind. Previous studies showed that scanpath, i.e., the sequence of eye movements made by an observer exploring a visual stimulus, can be used to infer observer-related (e.g., task at hand) and stimuli-related (e.g., image semantic category) information. However, eye movements are complex signals and many of these studies rely on limited gaze descriptors and bespoke datasets. Here, we provide a turnkey method for scanpath modeling and classification. This method relies on variational hidden Markov models (HMMs) and discriminant analysis (DA). HMMs encapsulate the dynamic and individualistic dimensions of gaze behavior, allowing DA to capture systematic patterns diagnostic of a given class of observers and/or stimuli. We test our approach on two very different datasets. Firstly, we use fixations recorded while viewing 800 static natural scene images, and infer an observer-related characteristic: the task at hand. We achieve an average of 55.9% correct classification rate (chance = 33%). We show that correct classification rates positively correlate with the number of salient regions present in the stimuli. Secondly, we use eye positions recorded while viewing 15 conversational videos, and infer a stimulus-related characteristic: the presence or absence of original soundtrack. We achieve an average 81.2% correct classification rate (chance = 50%). HMMs allow to integrate bottom-up, top-down, and oculomotor influences into a single model of gaze behavior. This synergistic approach between behavior and machine learning will open new avenues for simple quantification of gazing behavior. We release SMAC with HMM, a Matlab toolbox freely available to the community under an open-source license agreement.


Subject(s)
Eye Movements , Machine Learning , Markov Chains , Photic Stimulation/methods , Fixation, Ocular , Humans , Individuality , Probability , Task Performance and Analysis
7.
J Vis ; 14(11)2014 Sep 16.
Article in English | MEDLINE | ID: mdl-25228627

ABSTRACT

We use a hidden Markov model (HMM) based approach to analyze eye movement data in face recognition. HMMs are statistical models that are specialized in handling time-series data. We conducted a face recognition task with Asian participants, and model each participant's eye movement pattern with an HMM, which summarized the participant's scan paths in face recognition with both regions of interest and the transition probabilities among them. By clustering these HMMs, we showed that participants' eye movements could be categorized into holistic or analytic patterns, demonstrating significant individual differences even within the same culture. Participants with the analytic pattern had longer response times, but did not differ significantly in recognition accuracy from those with the holistic pattern. We also found that correct and wrong recognitions were associated with distinctive eye movement patterns; the difference between the two patterns lies in the transitions rather than locations of the fixations alone.


Subject(s)
Eye Movements/physiology , Face/physiology , Pattern Recognition, Visual/physiology , Recognition, Psychology/physiology , Adolescent , Female , Humans , Male , Markov Chains , Models, Statistical , Probability , Young Adult
8.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2882-2899, 2024 May.
Article in English | MEDLINE | ID: mdl-37995158

ABSTRACT

Typical approaches that learn crowd density maps are limited to extracting the supervisory information from the loosely organized spatial information in the crowd dot/density maps. This paper tackles this challenge by performing the supervision in the frequency domain. More specifically, we devise a new loss function for crowd analysis called generalized characteristic function loss (GCFL). This loss carries out two steps: 1) transforming the spatial information in density or dot maps to the frequency domain; 2) calculating a loss value between their frequency contents. For step 1, we establish a series of theoretical fundaments by extending the definition of the characteristic function for probability distributions to density maps, as well as proving some vital properties of the extended characteristic function. After taking the characteristic function of the density map, its information in the frequency domain is well-organized and hierarchically distributed, while in the spatial domain it is loose-organized and dispersed everywhere. In step 2, we design a loss function that can fit the information organization in the frequency domain, allowing the exploitation of the well-organized frequency information for the supervision of crowd analysis tasks. The loss function can be adapted to various crowd analysis tasks through the specification of its window functions. In this paper, we demonstrate its power in three tasks: Crowd Counting, Crowd Localization and Noisy Crowd Counting. We show the advantages of our GCFL compared to other SOTA losses and its competitiveness to other SOTA methods by theoretical analysis and empirical results on benchmark datasets. Our codes are available at https://github.com/wbshu/Crowd_Counting_in_the_Frequency_Domain.

9.
IEEE Trans Pattern Anal Mach Intell ; 46(9): 5967-5985, 2024 Sep.
Article in English | MEDLINE | ID: mdl-38517727

ABSTRACT

We propose the gradient-weighted Object Detector Activation Maps (ODAM), a visual explanation technique for interpreting the predictions of object detectors. Utilizing the gradients of detector targets flowing into the intermediate feature maps, ODAM produces heat maps that show the influence of regions on the detector's decision for each predicted attribute. Compared to previous works on classification activation maps (CAM), ODAM generates instance-specific explanations rather than class-specific ones. We show that ODAM is applicable to one-stage, two-stage, and transformer-based detectors with different types of detector backbones and heads, and produces higher-quality visual explanations than the state-of-the-art in terms of both effectiveness and efficiency. We discuss two explanation tasks for object detection: 1) object specification: what is the important region for the prediction? 2) object discrimination: which object is detected? Aiming at these two aspects, we present a detailed analysis of the visual explanations of detectors and carry out extensive experiments to validate the effectiveness of the proposed ODAM. Furthermore, we investigate user trust on the explanation maps, how well the visual explanations of object detectors agrees with human explanations, as measured through human eye gaze, and whether this agreement is related with user trust. Finally, we also propose two applications, ODAM-KD and ODAM-NMS, based on these two abilities of ODAM. ODAM-KD utilizes the object specification of ODAM to generate top-down attention for key predictions and instruct the knowledge distillation of object detection. ODAM-NMS considers the location of the model's explanation for each prediction to distinguish the duplicate detected objects. A training scheme, ODAM-Train, is proposed to improve the quality on object discrimination, and help with ODAM-NMS.

10.
Article in English | MEDLINE | ID: mdl-38809736

ABSTRACT

Graph neural networks (GNNs) are widely used for analyzing graph-structural data and solving graph-related tasks due to their powerful expressiveness. However, existing off-the-shelf GNN-based models usually consist of no more than three layers. Deeper GNNs usually suffer from severe performance degradation due to several issues including the infamous "over-smoothing" issue, which restricts the further development of GNNs. In this article, we investigate the over-smoothing issue in deep GNNs. We discover that over-smoothing not only results in indistinguishable embeddings of graph nodes, but also alters and even corrupts their semantic structures, dubbed semantic over-smoothing. Existing techniques, e.g., graph normalization, aim at handling the former concern, but neglect the importance of preserving the semantic structures in the spatial domain, which hinders the further improvement of model performance. To alleviate the concern, we propose a cluster-keeping sparse aggregation strategy to preserve the semantic structure of embeddings in deep GNNs (especially for spatial GNNs). Particularly, our strategy heuristically redistributes the extent of aggregations for all the nodes from layers, instead of aggregating them equally, so that it enables aggregate concise yet meaningful information for deep layers. Without any bells and whistles, it can be easily implemented as a plug-and-play structure of GNNs via weighted residual connections. Last, we analyze the over-smoothing issue on the GNNs with weighted residual structures and conduct experiments to demonstrate the performance comparable to the state-of-the-arts.

11.
Neural Netw ; 177: 106392, 2024 Sep.
Article in English | MEDLINE | ID: mdl-38788290

ABSTRACT

Explainable artificial intelligence (XAI) has been increasingly investigated to enhance the transparency of black-box artificial intelligence models, promoting better user understanding and trust. Developing an XAI that is faithful to models and plausible to users is both a necessity and a challenge. This work examines whether embedding human attention knowledge into saliency-based XAI methods for computer vision models could enhance their plausibility and faithfulness. Two novel XAI methods for object detection models, namely FullGrad-CAM and FullGrad-CAM++, were first developed to generate object-specific explanations by extending the current gradient-based XAI methods for image classification models. Using human attention as the objective plausibility measure, these methods achieve higher explanation plausibility. Interestingly, all current XAI methods when applied to object detection models generally produce saliency maps that are less faithful to the model than human attention maps from the same object detection task. Accordingly, human attention-guided XAI (HAG-XAI) was proposed to learn from human attention how to best combine explanatory information from the models to enhance explanation plausibility by using trainable activation functions and smoothing kernels to maximize the similarity between XAI saliency map and human attention map. The proposed XAI methods were evaluated on widely used BDD-100K, MS-COCO, and ImageNet datasets and compared with typical gradient-based and perturbation-based XAI methods. Results suggest that HAG-XAI enhanced explanation plausibility and user trust at the expense of faithfulness for image classification models, and it enhanced plausibility, faithfulness, and user trust simultaneously and outperformed existing state-of-the-art XAI methods for object detection models.


Subject(s)
Artificial Intelligence , Attention , Humans , Attention/physiology , Neural Networks, Computer
12.
IEEE Trans Neural Netw Learn Syst ; 34(12): 10653-10667, 2023 Dec.
Article in English | MEDLINE | ID: mdl-35576413

ABSTRACT

Multicamera surveillance has been an active research topic for understanding and modeling scenes. Compared to a single camera, multicameras provide larger field-of-view and more object cues, and the related applications are multiview counting, multiview tracking, 3-D pose estimation or 3-D reconstruction, and so on. It is usually assumed that the cameras are all temporally synchronized when designing models for these multicamera-based tasks. However, this assumption is not always valid, especially for multicamera systems with network transmission delay and low frame rates due to limited network bandwidth, resulting in desynchronization of the captured frames across cameras. To handle the issue of unsynchronized multicameras, in this article, we propose a synchronization model that works in conjunction with existing deep neural network (DNN)-based multiview models, thus avoiding the redesign of the whole model. We consider two variants of the model, based on where in the pipeline the synchronization occurs, scene-level synchronization and camera-level synchronization. The view synchronization step and the task-specific view fusion and prediction step are unified in the same framework and trained in an end-to-end fashion. Our view synchronization models are applied to different DNNs-based multicamera vision tasks under the unsynchronized setting, including multiview counting and 3-D pose estimation, and achieve good performance compared to baselines.

13.
Br J Psychol ; 114 Suppl 1: 17-20, 2023 May.
Article in English | MEDLINE | ID: mdl-36951761

ABSTRACT

Multiple factors have been proposed to contribute to the other-race effect in face recognition, including perceptual expertise and social-cognitive accounts. Here, we propose to understand the effect and its contributing factors from the perspectives of learning mechanisms that involve joint learning of visual attention strategies and internal representations for faces, which can be modulated by quality of contact with other-race individuals including emotional and motivational factors. Computational simulations of this process will enhance our understanding of interactions among factors and help resolve inconsistent results in the literature. In particular, since learning is driven by task demands, visual attention effects observed in different face-processing tasks, such as passive viewing or recognition, are likely to be task specific (although may be associated) and should be examined and compared separately. When examining visual attention strategies, the use of more data-driven and comprehensive eye movement measures, taking both spatial-temporal pattern and consistency of eye movements into account, can lead to novel discoveries in other-race face processing. The proposed framework and analysis methods may be applied to other tasks of real-life significance such as face emotion recognition, further enhancing our understanding of the relationship between learning and visual cognition.


Subject(s)
Pattern Recognition, Visual , Racial Groups , Humans , Racial Groups/psychology , Learning , Recognition, Psychology , Eye Movements
14.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15065-15080, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37506001

ABSTRACT

Point-wise supervision is widely adopted in computer vision tasks such as crowd counting and human pose estimation. In practice, the noise in point annotations may affect the performance and robustness of algorithm significantly. In this paper, we investigate the effect of annotation noise in point-wise supervision and propose a series of robust loss functions for different tasks. In particular, the point annotation noise includes spatial-shift noise, missing-point noise, and duplicate-point noise. The spatial-shift noise is the most common one, and exists in crowd counting, pose estimation, visual tracking, etc, while the missing-point and duplicate-point noises usually appear in dense annotations, such as crowd counting. In this paper, we first consider the shift noise by modeling the real locations as random variables and the annotated points as noisy observations. The probability density function of the intermediate representation (a smooth heat map generated from dot annotations) is derived and the negative log likelihood is used as the loss function to naturally model the shift uncertainty in the intermediate representation. The missing and duplicate noise are further modeled by an empirical way with the assumption that the noise appears at high density region with a high probability. We apply the method to crowd counting, human pose estimation and visual tracking, propose robust loss functions for those tasks, and achieve superior performance and robustness on widely used datasets.

15.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2088-2103, 2023 Feb.
Article in English | MEDLINE | ID: mdl-35294345

ABSTRACT

Recent image captioning models are achieving impressive results based on popular metrics, i.e., BLEU, CIDEr, and SPICE. However, focusing on the most popular metrics that only consider the overlap between the generated captions and human annotation could result in using common words and phrases, which lacks distinctiveness, i.e., many similar images have the same caption. In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images. First, we propose a distinctiveness metric-between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness; however, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions. In contrast, we reweight each ground-truth caption according to its distinctiveness during training. We further integrate a long-tailed weight strategy to highlight the rare words that contain more information, and captions from the similar image set are sampled as negative examples to encourage the generated sentence to be unique. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study.

16.
IEEE Trans Neural Netw Learn Syst ; 34(3): 1537-1551, 2023 Mar.
Article in English | MEDLINE | ID: mdl-34464269

ABSTRACT

The hidden Markov model (HMM) is a broadly applied generative model for representing time-series data, and clustering HMMs attract increased interest from machine learning researchers. However, the number of clusters ( K ) and the number of hidden states ( S ) for cluster centers are still difficult to determine. In this article, we propose a novel HMM-based clustering algorithm, the variational Bayesian hierarchical EM algorithm, which clusters HMMs through their densities and priors and simultaneously learns posteriors for the novel HMM cluster centers that compactly represent the structure of each cluster. The numbers K and S are automatically determined in two ways. First, we place a prior on the pair (K,S) and approximate their posterior probabilities, from which the values with the maximum posterior are selected. Second, some clusters and states are pruned out implicitly when no data samples are assigned to them, thereby leading to automatic selection of the model complexity. Experiments on synthetic and real data demonstrate that our algorithm performs better than using model selection techniques with maximum likelihood estimation.

17.
Dev Psychol ; 59(2): 353-363, 2023 Feb.
Article in English | MEDLINE | ID: mdl-36342437

ABSTRACT

Early attention bias to threat-related negative emotions may lead children to overestimate dangers in social situations. This study examined its emergence and how it might develop in tandem with a known predictor namely temperamental shyness for toddlers' fear of strangers in 168 Chinese toddlers. Measurable individual differences in such attention bias to fearful faces were found and remained stable from age 12 to 18 months. When shown photos of paired happy versus fearful or happy versus angry faces, toddlers initially gazed more and had longer initial fixation and total fixation at fearful faces compared with happy faces consistently. However, they initially gazed more at happy faces compared with angry faces consistently and had a longer total fixation at angry faces only at 18 months. Stranger anxiety at 12 months predicted attention bias to fearful faces at 18 months. Temperamentally shyer 12-month-olds went on to show stronger attention bias to fearful faces at 18 months, and their fear of strangers also increased more from 12 to 18 months. Together with prior research suggesting attention bias to angry or fearful faces foretelling social anxiety, the present findings point to likely positive feedback loops among attention bias to fearful faces, temperamental shyness, and stranger anxiety in early childhood. (PsycInfo Database Record (c) 2023 APA, all rights reserved).


Subject(s)
Facial Expression , Fear , Humans , Child, Preschool , Infant , Fear/psychology , Anxiety , Anger , Happiness , Emotions
18.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10519-10534, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37027650

ABSTRACT

Nested dropout is a variant of dropout operation that is able to order network parameters or features based on the pre-defined importance during training. It has been explored for: I. Constructing nested nets Cui et al. 2020, Cui et al. 2021: the nested nets are neural networks whose architectures can be adjusted instantly during testing time, e.g., based on computational constraints. The nested dropout implicitly ranks the network parameters, generating a set of sub-networks such that any smaller sub-network forms the basis of a larger one. II. Learning ordered representation Rippel et al. 2014: the nested dropout applied to the latent representation of a generative model (e.g., auto-encoder) ranks the features, enforcing explicit order of the dense representation over dimensions. However, the dropout rate is fixed as a hyper-parameter during the whole training process. For nested nets, when network parameters are removed, the performance decays in a human-specified trajectory rather than in a trajectory learned from data. For generative models, the importance of features is specified as a constant vector, restraining the flexibility of representation learning. To address the problem, we focus on the probabilistic counterpart of the nested dropout. We propose a variational nested dropout (VND) operation that draws samples of multi-dimensional ordered masks at a low cost, providing useful gradients to the parameters of nested dropout. Based on this approach, we design a Bayesian nested neural network that learns the order knowledge of the parameter distributions. We further exploit the VND under different generative models for learning ordered latent distributions. In experiments, we show that the proposed approach outperforms the nested network in terms of accuracy, calibration, and out-of-domain detection in classification tasks. It also outperforms the related generative models on data generation tasks.


Subject(s)
Algorithms , Neural Networks, Computer , Humans , Bayes Theorem , Learning
19.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 1035-1049, 2022 02.
Article in English | MEDLINE | ID: mdl-32749960

ABSTRACT

Diversity is one of the most important properties in image captioning, as it reflects various expressions of important concepts presented in an image. However, the most popular metrics cannot well evaluate the diversity of multiple captions. In this paper, we first propose a metric to measure the diversity of a set of captions, which is derived from latent semantic analysis (LSA), and then kernelize LSA using CIDEr (R. Vedantam et al., 2015) similarity. Compared with mBLEU (R. Shetty et al., 2017), our proposed diversity metrics show a relatively strong correlation to human evaluation. We conduct extensive experiments, finding there is a large gap between the performance of the current state-of-the-art models and human annotations considering both diversity and accuracy; the models that aim to generate captions with higher CIDEr scores normally obtain lower diversity scores, which generally learn to describe images using common words. To bridge this "diversity" gap, we consider several methods for training caption models to generate diverse captions. First, we show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions. Second, we develop approaches that directly optimize our diversity metric and CIDEr score using reinforcement learning. These proposed approaches using reinforcement learning (RL) can be unified into a self-critical (S. J. Rennie et al., 2017) framework with new RL baselines. Third, we combine accuracy and diversity into a single measure using an ensemble matrix, and then maximize the determinant of the ensemble matrix via reinforcement learning to boost diversity and accuracy, which outperforms its counterparts on the oracle test. Finally, inspired by determinantal point processes (DPP), we develop a DPP selection algorithm to select a subset of captions from a large number of candidate captions. The experimental results show that maximizing the determinant of the ensemble matrix outperforms other methods considerably improving diversity and accuracy.


Subject(s)
Algorithms , Benchmarking , Humans , Learning , Semantics
20.
IEEE Trans Pattern Anal Mach Intell ; 44(3): 1357-1370, 2022 Mar.
Article in English | MEDLINE | ID: mdl-32903177

ABSTRACT

Crowd counting is an essential topic in computer vision due to its practical usage in surveillance systems. The typical design of crowd counting algorithms is divided into two steps. First, the ground-truth density maps of crowd images are generated from the ground-truth dot maps (density map generation), e.g., by convolving with a Gaussian kernel. Second, deep learning models are designed to predict a density map from an input image (density map estimation). The density map based counting methods that incorporate density map as the intermediate representation have improved counting performance dramatically. However, in the sense of end-to-end training, the hand-crafted methods used for generating the density maps may not be optimal for the particular network or dataset used. To address this issue, we propose an adaptive density map generator, which takes the annotation dot map as input, and learns a density map representation for a counter. The counter and generator are trained jointly within an end-to-end framework. We also show that the proposed framework can be applied to general dense object counting tasks. Extensive experiments are conducted on 10 datasets for 3 applications: crowd counting, vehicle counting, and general object counting. The experiment results on these datasets confirm the effectiveness of the proposed learnable density map representations.

SELECTION OF CITATIONS
SEARCH DETAIL