Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-38356218

RESUMO

A key challenge for machine intelligence is to learn new visual concepts without forgetting the previously acquired knowledge. Continual learning (CL) is aimed toward addressing this challenge. However, there still exists a gap between CL and human learning. In particular, humans are able to continually learn from the samples associated with known or unknown labels in their daily lives, whereas existing CL and semi-supervised CL (SSCL) methods assume that the training samples are associated with known labels. Specifically, we are interested in two questions: 1) how to utilize unrelated unlabeled data for the SSCL task and 2) how unlabeled data affect learning and catastrophic forgetting in the CL task. To explore these issues, we formulate a new SSCL method, which can be generically applied to existing CL models. Furthermore, we propose a novel gradient learner to learn from labeled data to predict gradients on unlabeled data. In this way, the unlabeled data can fit into the supervised CL framework. We extensively evaluate the proposed method on mainstream CL methods, adversarial CL (ACL), and semi-supervised learning (SSL) tasks. The proposed method achieves state-of-the-art performance on classification accuracy and backward transfer (BWT) in the CL setting while achieving the desired performance on classification accuracy in the SSL setting. This implies that the unlabeled images can enhance the generalizability of CL models on the predictive ability of unseen data and significantly alleviate catastrophic forgetting. The code is available at https://github.com/luoyan407/grad_prediction.git.

2.
Artigo em Inglês | MEDLINE | ID: mdl-38289843

RESUMO

The conventional surface electromyography (sEMG)-based gesture recognition systems exhibit impressive performance in controlled laboratory settings. As most systems are trained in a closed-set setting, the systems's performance may see significant deterioration when novel gestures are presented as imposter. In addition, the state-of-the-art generative and discriminative methods have achieved considerable performance on high-density sEMG signals. This can be seen as an unrealistic setting as the real-world muscle computer interface are mainly comprised of sparse multichannel sEMG signals. In this work, we propose a novel variational autoencoder based approach for open-set gesture recognition based on sparse multichannel sEMG signals. Using the predefined corresponding latent conditional distribution of known gestures, the conditional Gaussian distribution of each known gesture is learned. Those samples with low probability density are identified as unknown gestures. The sEMG signals of known gestures are classified using the Kullback-Leibler divergences between the predefined prior distributions and input samples. The proposed approach is evaluated using three benchmark sparse multichannel sEMG databases. The experimental results demonstrate that our approach outperforms the existing open-set sEMG-based gesture recognition approaches and achieves a better trade-off between classifying known gestures and rejecting unknown gestures.


Assuntos
Gestos , Reconhecimento Psicológico , Humanos , Eletromiografia/métodos , Algoritmos , Mãos/fisiologia
3.
IEEE Trans Image Process ; 32: 3836-3846, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37410654

RESUMO

Visual Commonsense Reasoning (VCR), deemed as one challenging extension of Visual Question Answering (VQA), endeavors to pursue a higher-level visual comprehension. VCR includes two complementary processes: question answering over a given image and rationale inference for answering explanation. Over the years, a variety of VCR methods have pushed more advancements on the benchmark dataset. Despite significance of these methods, they often treat the two processes in a separate manner and hence decompose VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is broken, rendering existing efforts less faithful to visual reasoning. To empirically study this issue, we perform some in-depth empirical explorations in terms of both language shortcuts and generalization capability. Based on our findings, we then propose a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes. The key contribution lies in the introduction of a new branch, which serves as a relay to bridge the two processes. Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset. As demonstrated in the experimental results, when equipped with our method, these baselines all achieve consistent and significant performance improvements, evidently verifying the viability of processes coupling.

4.
Artigo em Inglês | MEDLINE | ID: mdl-37126635

RESUMO

Unlearning the data observed during the training of a machine learning (ML) model is an important task that can play a pivotal role in fortifying the privacy and security of ML-based applications. This article raises the following questions: 1) can we unlearn a single or multiple class(es) of data from an ML model without looking at the full training data even once? and 2) can we make the process of unlearning fast and scalable to large datasets, and generalize it to different deep networks? We introduce a novel machine unlearning framework with error-maximizing noise generation and impair-repair based weight manipulation that offers an efficient solution to the above questions. An error-maximizing noise matrix is learned for the class to be unlearned using the original model. The noise matrix is used to manipulate the model weights to unlearn the targeted class of data. We introduce impair and repair steps for a controlled manipulation of the network weights. In the impair step, the noise matrix along with a very high learning rate is used to induce sharp unlearning in the model. Thereafter, the repair step is used to regain the overall performance. With very few update steps, we show excellent unlearning while substantially retaining the overall model accuracy. Unlearning multiple classes requires a similar number of update steps as for a single class, making our approach scalable to large problems. Our method is quite efficient in comparison to the existing methods, works for multiclass unlearning, does not put any constraints on the original optimization mechanism or network design, and works well in both small and large-scale vision tasks. This work is an important step toward fast and easy implementation of unlearning in deep networks. Source code: https://github.com/vikram2000b/Fast-Machine-Unlearning.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2181-2192, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35320088

RESUMO

Due to the inherent unorderliness and irregularity of point cloud, points emerge inconsistently across different frames in a point cloud video. To capture the dynamics in point cloud videos, tracking points and limiting temporal modeling range are usually employed to preserve spatio-temporal structure. However, as points may flow in and out across frames, computing accurate point trajectories is extremely difficult, especially for long videos. Moreover, when points move fast, even in a small temporal window, points may still escape from a region. Besides, using the same temporal range for different motions may not accurately capture the temporal structure. In this paper, we propose a Point Spatio-Temporal Transformer (PST-Transformer). To preserve the spatio-temporal structure, PST-Transformer adaptively searches related or similar points across the entire video by performing self-attention on point features. Moreover, our PST-Transformer is equipped with an ability to encode spatio-temporal structure. Because point coordinates are irregular and unordered but point timestamps exhibit regularities and order, the spatio-temporal encoding is decoupled to reduce the impact of the spatial irregularity on the temporal modeling. By properly preserving and encoding spatio-temporal structure, our PST-Transformer effectively models point cloud videos and shows superior performance on 3D action recognition and 4D semantic segmentation.

6.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 1682-1699, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-35446761

RESUMO

Attending selectively to emotion-eliciting stimuli is intrinsic to human vision. In this research, we investigate how emotion-elicitation features of images relate to human selective attention. We create the EMOtional attention dataset (EMOd). It is a set of diverse emotion-eliciting images, each with (1) eye-tracking data from 16 subjects, (2) image context labels at both object- and scene-level. Based on analyses of human perceptions of EMOd, we report an emotion prioritization effect: emotion-eliciting content draws stronger and earlier human attention than neutral content, but this advantage diminishes dramatically after initial fixation. We find that human attention is more focused on awe eliciting and aesthetic vehicle and animal scenes in EMOd. Aiming to model the above human attention behavior computationally, we design a deep neural network (CASNet II), which includes a channel weighting subnetwork that prioritizes emotion-eliciting objects, and an Atrous Spatial Pyramid Pooling (ASPP) structure that learns the relative importance of image regions at multiple scales. Visualizations and quantitative analyses demonstrate the model's ability to simulate human attention behavior, especially on emotion-eliciting content.


Assuntos
Algoritmos , Tecnologia de Rastreamento Ocular , Animais , Humanos , Emoções , Atenção , Simulação por Computador
7.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 525-538, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-35130150

RESUMO

Motivated by scenarios where data is used for diverse prediction tasks, we study whether fair representation can be used to guarantee fairness for unknown tasks and for multiple fairness notions. We consider seven group fairness notions that cover the concepts of independence, separation, and calibration. Against the backdrop of the fairness impossibility results, we explore approximate fairness. We prove that, although fair representation might not guarantee fairness for all prediction tasks, it does guarantee fairness for an important subset of tasks-the tasks for which the representation is discriminative. Specifically, all seven group fairness notions are linearly controlled by fairness and discriminativeness of the representation. When an incompatibility exists between different fairness notions, fair and discriminative representation hits the sweet spot that approximately satisfies all notions. Motivated by our theoretical findings, we propose to learn both fair and discriminative representations using pretext loss which self-supervises learning, and Maximum Mean Discrepancy as a fair regularizer. Experiments on tabular, image, and face datasets show that using the learned representation, downstream predictions that we are unaware of when learning the representation indeed become fairer. The fairness guarantees computed from our theoretical results are all valid.

8.
IEEE Trans Cybern ; 52(8): 8114-8127, 2022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-33531330

RESUMO

Monocular image-based 3-D model retrieval aims to search for relevant 3-D models from a dataset given one RGB image captured in the real world, which can significantly benefit several applications, such as self-service checkout, online shopping, etc. To help advance this promising yet challenging research topic, we built a novel dataset and organized the first international contest for monocular image-based 3-D model retrieval. Moreover, we conduct a thorough analysis of the state-of-the-art methods. Existing methods can be classified into supervised and unsupervised methods. The supervised methods can be analyzed based on several important aspects, such as the strategies of domain adaptation, view fusion, loss function, and similarity measure. The unsupervised methods focus on solving this problem with unlabeled data and domain adaptation. Seven popular metrics are employed to evaluate the performance, and accordingly, we provide a thorough analysis and guidance for future work. To the best of our knowledge, this is the first benchmark for monocular image-based 3-D model retrieval, which aims to help related research in multiview feature learning, domain adaptation, and information retrieval.


Assuntos
Algoritmos , Benchmarking , Armazenamento e Recuperação da Informação
9.
IEEE Trans Neural Netw Learn Syst ; 33(12): 7655-7666, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-34152991

RESUMO

Scene graph generation (SGGen) is a challenging task due to a complex visual context of an image. Intuitively, the human visual system can volitionally focus on attended regions by salient stimuli associated with visual cues. For example, to infer the relationship between man and horse, the interaction between human leg and horseback can provide strong visual evidence to predict the predicate ride. Besides, the attended region face can also help to determine the object man. Till now, most of the existing works studied the SGGen by extracting coarse-grained bounding box features while understanding fine-grained visual regions received limited attention. To mitigate the drawback, this article proposes a region-aware attention learning method. The key idea is to explicitly construct the attention space to explore salient regions with the object and predicate inferences. First, we extract a set of regions in an image with the standard detection pipeline. Each region regresses to an object. Second, we propose the object-wise attention graph neural network (GNN), which incorporates attention modules into the graph structure to discover attended regions for object inference. Third, we build the predicate-wise co-attention GNN to jointly highlight subject's and object's attended regions for predicate inference. Particularly, each subject-object pair is connected with one of the latent predicates to construct one triplet. The proposed intra-triplet and inter-triplet learning mechanism can help discover the pair-wise attended regions to infer predicates. Extensive experiments on two popular benchmarks demonstrate the superiority of the proposed method. Additional ablation studies and visualization further validate its effectiveness.


Assuntos
Atenção , Redes Neurais de Computação , Masculino , Humanos , Cavalos , Animais , Aprendizagem
10.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 9918-9930, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34905491

RESUMO

In point cloud videos, point coordinates are irregular and unordered but point timestamps exhibit regularities and order. Grid-based networks for conventional video processing cannot be directly used to model raw point cloud videos. Therefore, in this work, we propose a point-based network that directly handles raw point cloud videos. First, to preserve the spatio-temporal local structure of point cloud videos, we design a point tube covering a local range along spatial and temporal dimensions. By progressively subsampling frames and points and enlarging the spatial radius as the point features are fed into higher-level layers, the point tube can capture video structure in a spatio-temporally hierarchical manner. Second, to reduce the impact of the spatial irregularity on temporal modeling, we decompose space and time when extracting point tube representations. Specifically, a spatial operation is employed to encode the local structure of each spatial region in a tube and a temporal operation is used to encode the dynamics of the spatial regions along the tube. Empirically, the proposed network shows strong performance on 3D action recognition, 4D semantic segmentation and scene flow estimation. Theoretically, we analyse the necessity to decompose space and time in point cloud video modeling and why the network outperforms existing methods.

11.
IEEE Trans Image Process ; 30: 8332-8341, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34587009

RESUMO

Raven's Progressive Matrices (RPM) is highly correlated with human intelligence, and it has been widely used to measure the abstract reasoning ability of humans. In this paper, to study the abstract reasoning capability of deep neural networks, we propose the first unsupervised learning method for solving RPM problems. Since the ground truth labels are not allowed, we design a pseudo target based on the prior constraints of the RPM formulation to approximate the ground-truth label, which effectively converts the unsupervised learning strategy into a supervised one. However, the correct answer is wrongly labelled by the pseudo target, and thus the noisy contrast will lead to inaccurate model training. To alleviate this issue, we propose to improve the model performance with negative answers. Moreover, we develop a decentralization method to adapt the feature representation to different RPM problems. Extensive experiments on three datasets demonstrate that our method even outperforms some of the supervised approaches. Our code is available at https://github.com/visiontao/ncd.


Assuntos
Algoritmos , Resolução de Problemas , Humanos , Inteligência , Testes de Inteligência , Redes Neurais de Computação
12.
IEEE Trans Pattern Anal Mach Intell ; 43(6): 1928-1946, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-31902755

RESUMO

One of the well-known challenges in computer vision tasks is the visual diversity of images, which could result in an agreement or disagreement between the learned knowledge and the visual content exhibited by the current observation. In this work, we first define such an agreement in a concepts learning process as congruency. Formally, given a particular task and sufficiently large dataset, the congruency issue occurs in the learning process whereby the task-specific semantics in the training data are highly varying. We propose a Direction Concentration Learning (DCL) method to improve congruency in the learning process, where enhancing congruency influences the convergence path to be less circuitous. The experimental results show that the proposed DCL method generalizes to state-of-the-art models and optimizers, as well as improves the performances of saliency prediction task, continual learning task, and classification task. Moreover, it helps mitigate the catastrophic forgetting problem in the continual learning task. The code is publicly available at https://github.com/luoyan407/congruency.

13.
IEEE Trans Neural Netw Learn Syst ; 31(2): 685-699, 2020 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-31094695

RESUMO

Intraclass compactness and interclass separability are crucial indicators to measure the effectiveness of a model to produce discriminative features, where intraclass compactness indicates how close the features with the same label are to each other and interclass separability indicates how far away the features with different labels are. In this paper, we investigate intraclass compactness and interclass separability of features learned by convolutional networks and propose a Gaussian-based softmax ( G -softmax) function that can effectively improve intraclass compactness and interclass separability. The proposed function is simple to implement and can easily replace the softmax function. We evaluate the proposed G -softmax function on classification data sets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet) and on multilabel classification data sets (i.e., MS COCO and NUS-WIDE). The experimental results show that the proposed G -softmax function improves the state-of-the-art models across all evaluated data sets. In addition, the analysis of the intraclass compactness and interclass separability demonstrates the advantages of the proposed function over the softmax function, which is consistent with the performance improvement. More importantly, we observe that high intraclass compactness and interclass separability are linearly correlated with average precision on MS COCO and NUS-WIDE. This implies that the improvement of intraclass compactness and interclass separability would lead to the improvement of average precision.

14.
IEEE Trans Image Process ; 29: 237-249, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31369377

RESUMO

Unsupervised video object segmentation aims to automatically segment moving objects over an unconstrained video without any user annotation. So far, only few unsupervised online methods have been reported in the literature, and their performance is still far from satisfactory because the complementary information from future frames cannot be processed under online setting. To solve this challenging problem, in this paper, we propose a novel unsupervised online video object segmentation (UOVOS) framework by construing the motion property to mean moving in concurrence with a generic object for segmented regions. By incorporating the salient motion detection and the object proposal, a pixel-wise fusion strategy is developed to effectively remove detection noises, such as dynamic background and stationary objects. Furthermore, by leveraging the obtained segmentation from immediately preceding frames, a forward propagation algorithm is employed to deal with unreliable motion detection and object proposals. Experimental results on several benchmark datasets demonstrate the efficacy of the proposed method. Compared to state-of-the-art unsupervised online segmentation algorithms, the proposed method achieves an absolute gain of 6.2%. Moreover, our method achieves better performance than the best unsupervised offline algorithm on the DAVIS-2016 benchmark dataset. Our code is available on the project website: https://www.github.com/visiontao/uovos.

15.
PLoS One ; 14(9): e0221390, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31513592

RESUMO

Sensor-based human activity recognition aims at detecting various physical activities performed by people with ubiquitous sensors. Different from existing deep learning-based method which mainly extracting black-box features from the raw sensor data, we propose a hierarchical multi-view aggregation network based on multi-view feature spaces. Specifically, we first construct various views of feature spaces for each individual sensor in terms of white-box features and black-box features. Then our model learns a unified representation for multi-view features by aggregating views in a hierarchical context from the aspect of feature level, position level and modality level. We design three aggregation modules corresponding to each level aggregation respectively. Based on the idea of non-local operation and attention, our fusion method is able to capture the correlation between features and leverage the relationship across different sensor position and modality. We comprehensively evaluate our method on 12 human activity benchmark datasets and the resulting accuracy outperforms the state-of-the-art approaches.


Assuntos
Atividades Humanas , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Benchmarking , Humanos , Redes Neurais de Computação , Reconhecimento Psicológico
16.
PLoS One ; 14(4): e0214444, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30969988

RESUMO

Humans are able to achieve visual object recognition rapidly and effortlessly. Object categorization is commonly believed to be achieved by interaction between bottom-up and top-down cognitive processing. In the ultra-rapid categorization scenario where the stimuli appear briefly and response time is limited, it is assumed that a first sweep of feedforward information is sufficient to discriminate whether or not an object is present in a scene. However, whether and how feedback/top-down processing is involved in such a brief duration remains an open question. To this end, here, we would like to examine how different top-down manipulations, such as category level, category type and real-world size, interact in ultra-rapid categorization. We have constructed a dataset comprising real-world scene images with a built-in measurement of target object display size. Based on this set of images, we have measured ultra-rapid object categorization performance by human subjects. Standard feedforward computational models representing scene features and a state-of-the-art object detection model were employed for auxiliary investigation. The results showed the influences from 1) animacy (animal, vehicle, food), 2) level of abstraction (people, sport), and 3) real-world size (four target size levels) on ultra-rapid categorization processes. This had an impact to support the involvement of top-down processing when rapidly categorizing certain objects, such as sport at a fine grained level. Our work on human vs. model comparisons also shed light on possible collaboration and integration of the two that may be of interest to both experimental and computational vision researches. All the collected images and behavioral data as well as code and models are publicly available at https://osf.io/mqwjz/.


Assuntos
Cognição , Simulação por Computador , Reconhecimento Visual de Modelos , Adolescente , Adulto , Algoritmos , Animais , Feminino , Alimentos , Humanos , Masculino , Veículos Automotores , Reconhecimento Automatizado de Padrão , Estimulação Luminosa , Tempo de Reação , Visão Ocular , Adulto Jovem
17.
IEEE Trans Biomed Eng ; 66(10): 2964-2973, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-30762526

RESUMO

Gesture recognition using sparse multichannel surface electromyography (sEMG) is a challenging problem, and the solutions are far from optimal from the point of view of muscle-computer interface. In this paper, we address this problem from the context of multi-view deep learning. A novel multi-view convolutional neural network (CNN) framework is proposed by combining classical sEMG feature sets with a CNN-based deep learning model. The framework consists of two parts. In the first part, multi-view representations of sEMG are modeled in parallel by a multistream CNN, and a performance-based view construction strategy is proposed to choose the most discriminative views from classical feature sets for sEMG-based gesture recognition. In the second part, the learned multi-view deep features are fused through a view aggregation network composed of early and late fusion subnetworks, taking advantage of both early and late fusion of learned multi-view deep features. Evaluations on 11 sparse multichannel sEMG databases as well as five databases with both sEMG and inertial measurement unit data demonstrate that our multi-view framework outperforms single-view methods on both unimodal and multimodal sEMG data streams.


Assuntos
Aprendizado Profundo , Eletromiografia/métodos , Gestos , Interface Usuário-Computador , Conjuntos de Dados como Assunto , Humanos
18.
PLoS One ; 13(10): e0206049, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30376567

RESUMO

The surface electromyography (sEMG)-based gesture recognition with deep learning approach plays an increasingly important role in human-computer interaction. Existing deep learning architectures are mainly based on Convolutional Neural Network (CNN) architecture which captures spatial information of electromyogram signal. Motivated by the sequential nature of electromyogram signal, we propose an attention-based hybrid CNN and RNN (CNN-RNN) architecture to better capture temporal properties of electromyogram signal for gesture recognition problem. Moreover, we present a new sEMG image representation method based on a traditional feature vector which enables deep learning architectures to extract implicit correlations between different channels for sparse multi-channel electromyogram signal. Extensive experiments on five sEMG benchmark databases show that the proposed method outperforms all reported state-of-the-art methods on both sparse multi-channel and high-density sEMG databases. To compare with the existing works, we set the window length to 200ms for NinaProDB1 and NinaProDB2, and 150ms for BioPatRec sub-database, CapgMyo sub-database, and csl-hdemg databases. The recognition accuracies of the aforementioned benchmark databases are 87.0%, 82.2%, 94.1%, 99.7% and 94.5%, which are 9.2%, 3.5%, 1.2%, 0.2% and 5.2% higher than the state-of-the-art performance, respectively.


Assuntos
Algoritmos , Atenção/fisiologia , Eletromiografia , Gestos , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão , Bases de Dados como Assunto , Humanos , Processamento de Imagem Assistida por Computador , Processamento de Sinais Assistido por Computador , Fatores de Tempo
19.
IEEE Trans Pattern Anal Mach Intell ; 39(1): 102-114, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-26955018

RESUMO

This paper proposes a hierarchical clustering multi-task learning (HC-MTL) method for joint human action grouping and recognition. Specifically, we formulate the objective function into the group-wise least square loss regularized by low rank and sparsity with respect to two latent variables, model parameters and grouping information, for joint optimization. To handle this non-convex optimization, we decompose it into two sub-tasks, multi-task learning and task relatedness discovery. First, we convert this non-convex objective function into the convex formulation by fixing the latent grouping information. This new objective function focuses on multi-task learning by strengthening the shared-action relationship and action-specific feature learning. Second, we leverage the learned model parameters for the task relatedness measure and clustering. In this way, HC-MTL can attain both optimal action models and group discovery by alternating iteratively. The proposed method is validated on three kinds of challenging datasets, including six realistic action datasets (Hollywood2, YouTube, UCF Sports, UCF50, HMDB51 & UCF101), two constrained datasets (KTH & TJU), and two multi-view datasets (MV-TJU & IXMAS). The extensive experimental results show that: 1) HC-MTL can produce competing performances to the state of the arts for action recognition and grouping; 2) HC-MTL can overcome the difficulty in heuristic action grouping simply based on human knowledge; 3) HC-MTL can avoid the possible inconsistency between the subjective action grouping depending on human knowledge and objective action grouping based on the feature subspace distributions of multiple actions. Comparison with the popular clustered multi-task learning further reveals that the discovered latent relatedness by HC-MTL aids inducing the group-wise multi-task learning and boosts the performance. To the best of our knowledge, ours is the first work that breaks the assumption that all actions are either independent for individual learning or correlated for joint modeling and proposes HC-MTL for automated, joint action grouping and modeling.


Assuntos
Inteligência Artificial , Aprendizagem , Algoritmos , Análise por Conglomerados , Bases de Dados Factuais , Humanos , Análise dos Mínimos Quadrados , Aprendizado de Máquina , Reconhecimento Automatizado de Padrão , Análise e Desempenho de Tarefas
20.
IEEE Trans Cybern ; 47(7): 1781-1794, 2017 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-27429453

RESUMO

Human action recognition is an active research area in both computer vision and machine learning communities. In the past decades, the machine learning problem has evolved from conventional single-view learning problem, to cross-view learning, cross-domain learning and multitask learning, where a large number of algorithms have been proposed in the literature. Despite having large number of action recognition datasets, most of them are designed for a subset of the four learning problems, where the comparisons between algorithms can further limited by variances within datasets, experimental configurations, and other factors. To the best of our knowledge, there exists no dataset that allows concurrent analysis on the four learning problems. In this paper, we introduce a novel multimodal and multiview and interactive (M2I) dataset, which is designed for the evaluation of human action recognition methods under all four scenarios. This dataset consists of 1760 action samples from 22 action categories, including nine person-person interactive actions and 13 person-object interactive actions. We systematically benchmark state-of-the-art approaches on M2I dataset on all four learning problems. Overall, we evaluated 13 approaches with nine popular feature and descriptor combinations. Our comprehensive analysis demonstrates that M2I dataset is challenging due to significant intraclass and view variations, and multiple similar action categories, as well as provides solid foundation for the evaluation of existing state-of-the-art algorithms.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA