Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
1.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3537-3556, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38145536

RESUMO

3D object detection from images, one of the fundamental and challenging problems in autonomous driving, has received increasing attention from both industry and academia in recent years. Benefiting from the rapid development of deep learning technologies, image-based 3D detection has achieved remarkable progress. Particularly, more than 200 works have studied this problem from 2015 to 2021, encompassing a broad spectrum of theories, algorithms, and applications. However, to date no recent survey exists to collect and organize this knowledge. In this paper, we fill this gap in the literature and provide the first comprehensive survey of this novel and continuously growing research field, summarizing the most commonly used pipelines for image-based 3D detection and deeply analyzing each of their components. Additionally, we also propose two new taxonomies to organize the state-of-the-art methods into different categories, with the intent of providing a more systematic review of existing methods and facilitating fair comparisons with future works. In retrospect of what has been achieved so far, we also analyze the current challenges in the field and discuss future directions for image-based 3D detection research.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 14234-14247, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37647185

RESUMO

Deep-learning models for 3D point cloud semantic segmentation exhibit limited generalization capabilities when trained and tested on data captured with different sensors or in varying environments due to domain shift. Domain adaptation methods can be employed to mitigate this domain shift, for instance, by simulating sensor noise, developing domain-agnostic generators, or training point cloud completion networks. Often, these methods are tailored for range view maps or necessitate multi-modal input. In contrast, domain adaptation in the image domain can be executed through sample mixing, which emphasizes input data manipulation rather than employing distinct adaptation modules. In this study, we introduce compositional semantic mixing for point cloud domain adaptation, representing the first unsupervised domain adaptation technique for point cloud segmentation based on semantic and geometric sample mixing. We present a two-branch symmetric network architecture capable of concurrently processing point clouds from a source domain (e.g. synthetic) and point clouds from a target domain (e.g. real-world). Each branch operates within one domain by integrating selected data fragments from the other domain and utilizing semantic information derived from source labels and target (pseudo) labels. Additionally, our method can leverage a limited number of human point-level annotations (semi-supervised) to further enhance performance. We assess our approach in both synthetic-to-real and real-to-real scenarios using LiDAR datasets and demonstrate that it significantly outperforms state-of-the-art methods in both unsupervised and semi-supervised settings.

3.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2567-2581, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35358042

RESUMO

A fundamental and challenging problem in deep learning is catastrophic forgetting, i.e., the tendency of neural networks to fail to preserve the knowledge acquired from old tasks when learning new tasks. This problem has been widely investigated in the research community and several Incremental Learning (IL) approaches have been proposed in the past years. While earlier works in computer vision have mostly focused on image classification and object detection, more recently some IL approaches for semantic segmentation have been introduced. These previous works showed that, despite its simplicity, knowledge distillation can be effectively employed to alleviate catastrophic forgetting. In this paper, we follow this research direction and, inspired by recent literature on contrastive learning, we propose a novel distillation framework, Uncertainty-aware Contrastive Distillation (UCD). In a nutshell, UCDis operated by introducing a novel distillation loss that takes into account all the images in a mini-batch, enforcing similarity between features associated to all the pixels from the same classes, and pulling apart those corresponding to pixels from different classes. In order to mitigate catastrophic forgetting, we contrast features of the new model with features extracted by a frozen model learned at the previous incremental step. Our experimental results demonstrate the advantage of the proposed distillation technique, which can be used in synergy with previous IL approaches, and leads to state-of-art performance on three commonly adopted benchmarks for incremental semantic segmentation.

4.
Animals (Basel) ; 12(20)2022 Oct 19.
Artigo em Inglês | MEDLINE | ID: mdl-36290219

RESUMO

The human-animal relationship is ancient, complex and multifaceted. It may have either positive effects on humans and animals or poor or even negative and detrimental effects on animals or both humans and animals. A large body of literature has investigated the beneficial effects of this relationship in which both human and animals appear to gain physical and psychological benefits from living together in a reciprocated interaction. However, analyzing the literature with a different perspective it clearly emerges that not rarely are human-animal relationships characterized by different forms and levels of discomfort and suffering for animals and, in some cases, also for people. The negative physical and psychological consequences on animals' well-being may be very nuanced and concealed, but there are situations in which the negative consequences are clear and striking, as in the case of animal violence, abuse or neglect. Empathy, attachment and anthropomorphism are human psychological mechanisms that are considered relevant for positive and healthy relationships with animals, but when dysfunctional or pathological determine physical or psychological suffering, or both, in animals as occurs in animal hoarding. The current work reviews some of the literature on the multifaceted nature of the human-animal relationship; describes the key role of empathy, attachment and anthropomorphism in human-animal relationships; seeks to depict how these psychological processes are distorted and dysfunctional in animal hoarding, with highly detrimental effects on both animal and human well-being.

5.
Artigo em Inglês | MEDLINE | ID: mdl-36215371

RESUMO

In the structure from motion, the viewing graph is a graph where the vertices correspond to cameras (or images) and the edges represent the fundamental matrices. We provide a new formulation and an algorithm for determining whether a viewing graph is solvable, i.e., uniquely determines a set of projective cameras. The known theoretical conditions either do not fully characterize the solvability of all viewing graphs, or are extremely difficult to compute because they involve solving a system of polynomial equations with a large number of unknowns. The main result of this paper is a method to reduce the number of unknowns by exploiting cycle consistency. We advance the understanding of solvability by (i) finishing the classification of all minimal graphs up to 9 nodes, (ii) extending the practical verification of solvability to minimal graphs with up to 90 nodes, (iii) finally answering an open research question by showing that finite solvability is not equivalent to solvability, and (iv) formally drawing the connection with the calibrated case (i.e., parallel rigidity). Finally, we present an experiment on real data that shows that unsolvable graphs may appear in practice.

6.
Artigo em Inglês | MEDLINE | ID: mdl-35235506

RESUMO

Convolutional neural networks have enabled major progresses in addressing pixel-level prediction tasks such as semantic segmentation, depth estimation, surface normal prediction and so on, benefiting from their powerful capabilities in visual representation learning. Typically, state of the art models integrate attention mechanisms for improved deep feature representations. Recently, some works have demonstrated the significance of learning and combining both spatial- and channel-wise attentions for deep feature refinement. In this paper, we aim at effectively boosting previous approaches and propose a unified deep framework to jointly learn both spatial attention maps and channel attention vectors in a principled manner so as to structure the resulting attention tensors and model interactions between these two types of attentions. Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework, leading to VarIational STructured Attention networks (VISTA-Net). We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters. As demonstrated by our extensive empirical evaluation on six large-scale datasets for dense visual prediction, VISTA-Net outperforms the state-of-the-art in multiple continuous and discrete prediction tasks, thus confirming the benefit of the proposed approach in joint structured spatial-channel attention estimation for deep representation learning. The code is available at https://github.com/ygjwd12345/VISTA-Net.

7.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 10099-10113, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34882548

RESUMO

Deep neural networks have enabled major progresses in semantic segmentation. However, even the most advanced neural architectures suffer from important limitations. First, they are vulnerable to catastrophic forgetting, i.e., they perform poorly when they are required to incrementally update their model as new classes are available. Second, they rely on large amount of pixel-level annotations to produce accurate segmentation maps. To tackle these issues, we introduce a novel incremental class learning approach for semantic segmentation taking into account a peculiar aspect of this task: since each training step provides annotation only for a subset of all possible classes, pixels of the background class exhibit a semantic shift. Therefore, we revisit the traditional distillation paradigm by designing novel loss terms which explicitly account for the background shift. Additionally, we introduce a novel strategy to initialize classifier's parameters at each step in order to prevent biased predictions toward the background class. Finally, we demonstrate that our approach can be extended to point- and scribble-based weakly supervised segmentation, modeling the partial annotations to create priors for unlabeled pixels. We demonstrate the effectiveness of our approach with an extensive evaluation on the Pascal-VOC, ADE20K, and Cityscapes datasets, significantly outperforming state-of-the-art methods.

8.
IEEE Trans Pattern Anal Mach Intell ; 44(5): 2673-2688, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-33301402

RESUMO

Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion. In contrast to previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, and simply fusing the features with weighted averaging or concatenation, we propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner. In order to further improve the learning capacity of the network structure, we propose to exploit feature dependant conditional kernels within the deep probabilistic framework. Extensive experiments are conducted on four publicly available datasets (i.e. BSDS500, NYUD-V2, KITTI and Pascal-Context) and on three challenging pixel-wise prediction problems involving both discrete and continuous labels (i.e. monocular depth estimation, object contour prediction and semantic segmentation). Quantitative and qualitative results demonstrate the effectiveness of the proposed latent AG-CRF model and the overall probabilistic graph attention network with feature conditional kernels for structured feature learning and pixel-wise prediction.

9.
IEEE Trans Pattern Anal Mach Intell ; 43(2): 485-498, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-31398109

RESUMO

Unsupervised Domain Adaptation (UDA) refers to the problem of learning a model in a target domain where labeled data are not available by leveraging information from annotated data in a source domain. Most deep UDA approaches operate in a single-source, single-target scenario, i.e., they assume that the source and the target samples arise from a single distribution. However, in practice most datasets can be regarded as mixtures of multiple domains. In these cases, exploiting traditional single-source, single-target methods for learning classification models may lead to poor results. Furthermore, it is often difficult to provide the domain labels for all data points, i.e. latent domains should be automatically discovered. This paper introduces a novel deep architecture which addresses the problem of UDA by automatically discovering latent domains in visual datasets and exploiting this information to learn robust target classifiers. Specifically, our architecture is based on two main components, i.e. a side branch that automatically computes the assignment of each sample to its latent domain and novel layers that exploit domain membership information to appropriately align the distribution of the CNN internal feature representations to a reference distribution. We evaluate our approach on publicly available benchmarks, showing that it outperforms state-of-the-art domain adaptation methods.

10.
IEEE Trans Pattern Anal Mach Intell ; 43(12): 4441-4452, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-32750781

RESUMO

One of the main challenges for developing visual recognition systems working in the wild is to devise computational models immune from the domain shift problem, i.e., accurate when test data are drawn from a (slightly) different data distribution than training samples. In the last decade, several research efforts have been devoted to devise algorithmic solutions for this issue. Recent attempts to mitigate domain shift have resulted into deep learning models for domain adaptation which learn domain-invariant representations by introducing appropriate loss terms, by casting the problem within an adversarial learning framework or by embedding into deep network specific domain normalization layers. This paper describes a novel approach for unsupervised domain adaptation. Similarly to previous works we propose to align the learned representations by embedding them into appropriate network feature normalization layers. Opposite to previous works, our Domain Alignment Layers are designed not only to match the source and target feature distributions but also to automatically learn the degree of feature alignment required at different levels of the deep network. Differently from most previous deep domain adaptation methods, our approach is able to operate in a multi-source setting. Thorough experiments on four publicly available benchmarks confirm the effectiveness of our approach.

11.
IEEE Trans Med Imaging ; 39(8): 2676-2687, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-32406829

RESUMO

Deep learning (DL) has proved successful in medical imaging and, in the wake of the recent COVID-19 pandemic, some works have started to investigate DL-based solutions for the assisted diagnosis of lung diseases. While existing works focus on CT scans, this paper studies the application of DL techniques for the analysis of lung ultrasonography (LUS) images. Specifically, we present a novel fully-annotated dataset of LUS images collected from several Italian hospitals, with labels indicating the degree of disease severity at a frame-level, video-level, and pixel-level (segmentation masks). Leveraging these data, we introduce several deep models that address relevant tasks for the automatic analysis of LUS images. In particular, we present a novel deep network, derived from Spatial Transformer Networks, which simultaneously predicts the disease severity score associated to a input frame and provides localization of pathological artefacts in a weakly-supervised way. Furthermore, we introduce a new method based on uninorms for effective frame score aggregation at a video-level. Finally, we benchmark state of the art deep models for estimating pixel-level segmentations of COVID-19 imaging biomarkers. Experiments on the proposed dataset demonstrate satisfactory results on all the considered tasks, paving the way to future research on DL for the assisted diagnosis of COVID-19 from LUS data.


Assuntos
Infecções por Coronavirus/diagnóstico por imagem , Aprendizado Profundo , Interpretação de Imagem Assistida por Computador/métodos , Pneumonia Viral/diagnóstico por imagem , Ultrassonografia/métodos , Betacoronavirus , COVID-19 , Humanos , Pulmão/diagnóstico por imagem , Pandemias , Sistemas Automatizados de Assistência Junto ao Leito , SARS-CoV-2
12.
IEEE Trans Pattern Anal Mach Intell ; 42(10): 2380-2395, 2020 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-31545713

RESUMO

Recent deep monocular depth estimation approaches based on supervised regression have achieved remarkable performance. However, they require costly ground truth annotations during training. To cope with this issue, in this paper we present a novel unsupervised deep learning approach for predicting depth maps. We introduce a new network architecture, named Progressive Fusion Network (PFN), that is specifically designed for binocular stereo depth estimation. This network is based on a multi-scale refinement strategy that combines the information provided by both stereo views. In addition, we propose to stack twice this network in order to form a cycle. This cycle approach can be interpreted as a form of data-augmentation since, at training time, the network learns both from the training set images (in the forward half-cycle) but also from the synthesized images (in the backward half-cycle). The architecture is jointly trained with adversarial learning. Extensive experiments on the publicly available datasets KITTI, Cityscapes and ApolloScape demonstrate the effectiveness of the proposed model which is competitive with other unsupervised deep learning methods for depth prediction.

13.
IEEE Trans Pattern Anal Mach Intell ; 41(6): 1426-1440, 2019 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-29994300

RESUMO

Depth cues have been proved very useful in various computer vision and robotic tasks. This paper addresses the problem of monocular depth estimation from a single still image. Inspired by the effectiveness of recent works on multi-scale convolutional neural networks (CNN), we propose a deep model which fuses complementary information derived from multiple CNN side outputs. Different from previous methods using concatenation or weighted average schemes, the integration is obtained by means of continuous Conditional Random Fields (CRFs). In particular, we propose two different variations, one based on a cascade of multiple CRFs, the other on a unified graphical model. By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end. Through an extensive experimental evaluation, we demonstrate the effectiveness of the proposed approach and establish new state of the art results for the monocular depth estimation task on three publicly available datasets, i.e., NYUD-V2, Make3D and KITTI.

14.
IEEE Trans Image Process ; 27(9): 4410-4421, 2018 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-29870357

RESUMO

In this paper, we address the problem of learning robust cross-domain representations for sketch-based image retrieval (SBIR). While, most SBIR approaches focus on extracting low- and mid-level descriptors for direct feature matching, recent works have shown the benefit of learning coupled feature representations to describe data from two related sources. However, cross-domain representation learning methods are typically cast into non-convex minimization problems that are difficult to optimize, leading to unsatisfactory performance. Inspired by self-paced learning (SPL), a learning methodology designed to overcome convergence issues related to local optima by exploiting the samples in a meaningful order (i.e., easy to hard), we introduce the cross-paced partial curriculum learning (CPPCL) framework. Compared with existing SPL methods which only consider a single modality and cannot deal with prior knowledge, CPPCL is specifically designed to assess the learning pace by jointly handling data from dual sources and modality-specific prior information provided in the form of partial curricula. In addition, thanks to the learned dictionaries, we demonstrate that the proposed CPPCL embeds robust coupled representations for SBIR. Our approach is extensively evaluated on four publicly available datasets (i.e., CUFS, Flickr15K, QueenMary SBIR, and TU-Berlin Extension datasets), showing superior performance over competing SBIR methods.

15.
Med Biol Eng Comput ; 55(6): 909-921, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-27638109

RESUMO

Stroke patients should be dispatched at the highest level of care available in the shortest time. In this context, a transportable system in specialized ambulances, able to evaluate the presence of an acute brain lesion in a short time interval (i.e., few minutes), could shorten delay of treatment. UWB radar imaging is an emerging diagnostic branch that has great potential for the implementation of a transportable and low-cost device. Transportability, low cost and short response time pose challenges to the signal processing algorithms of the backscattered signals as they should guarantee good performance with a reasonably low number of antennas and low computational complexity, tightly related to the response time of the device. The paper shows that a PCA-based preprocessing algorithm can: (1) achieve good performance already with a computationally simple beamforming algorithm; (2) outperform state-of-the-art preprocessing algorithms; (3) enable a further improvement in the performance (and/or decrease in the number of antennas) by using a multistatic approach with just a modest increase in computational complexity. This is an important result toward the implementation of such a diagnostic device that could play an important role in emergency scenario.


Assuntos
Diagnóstico por Imagem/métodos , Acidente Vascular Cerebral/diagnóstico , Algoritmos , Artefatos , Humanos , Micro-Ondas , Radar/instrumentação , Processamento de Sinais Assistido por Computador
16.
IEEE Trans Pattern Anal Mach Intell ; 38(6): 1070-83, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-26372209

RESUMO

Recently, head pose estimation (HPE) from low-resolution surveillance data has gained in importance. However, monocular and multi-view HPE approaches still work poorly under target motion, as facial appearance distorts owing to camera perspective and scale changes when a person moves around. To this end, we propose FEGA-MTL, a novel framework based on Multi-Task Learning (MTL) for classifying the head pose of a person who moves freely in an environment monitored by multiple, large field-of-view surveillance cameras. Upon partitioning the monitored scene into a dense uniform spatial grid, FEGA-MTL simultaneously clusters grid partitions into regions with similar facial appearance, while learning region-specific head pose classifiers. In the learning phase, guided by two graphs which a-priori model the similarity among (1) grid partitions based on camera geometry and (2) head pose classes, FEGA-MTL derives the optimal scene partitioning and associated pose classifiers. Upon determining the target's position using a person tracker at test time, the corresponding region-specific classifier is invoked for HPE. The FEGA-MTL framework naturally extends to a weakly supervised setting where the target's walking direction is employed as a proxy in lieu of head orientation. Experiments confirm that FEGA-MTL significantly outperforms competing single-task and multi-task learning methods in multi-view settings.


Assuntos
Algoritmos , Cabeça , Movimento (Física) , Humanos , Aprendizagem , Orientação
17.
IEEE Trans Pattern Anal Mach Intell ; 38(8): 1707-20, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-26540677

RESUMO

Studying free-standing conversational groups (FCGs) in unstructured social settings (e.g., cocktail party ) is gratifying due to the wealth of information available at the group (mining social networks) and individual (recognizing native behavioral and personality traits) levels. However, analyzing social scenes involving FCGs is also highly challenging due to the difficulty in extracting behavioral cues such as target locations, their speaking activity and head/body pose due to crowdedness and presence of extreme occlusions. To this end, we propose SALSA, a novel dataset facilitating multimodal and Synergetic sociAL Scene Analysis, and make two main contributions to research on automated social interaction analysis: (1) SALSA records social interactions among 18 participants in a natural, indoor environment for over 60 minutes, under the poster presentation and cocktail party contexts presenting difficulties in the form of low-resolution images, lighting variations, numerous occlusions, reverberations and interfering sound sources; (2) To alleviate these problems we facilitate multimodal analysis by recording the social interplay using four static surveillance cameras and sociometric badges worn by each participant, comprising the microphone, accelerometer, bluetooth and infrared sensors. In addition to raw data, we also provide annotations concerning individuals' personality as well as their position, head, body orientation and F-formation information over the entire event duration. Through extensive experiments with state-of-the-art approaches, we show (a) the limitations of current methods and (b) how the recorded multiple cues synergetically aid automatic analysis of social interactions. SALSA is available at http://tev.fbk.eu/salsa.


Assuntos
Algoritmos , Conjuntos de Dados como Assunto , Processos Grupais , Reconhecimento Automatizado de Padrão , Comportamento Social , Adulto , Sinais (Psicologia) , Feminino , Humanos , Relações Interpessoais , Iluminação , Masculino , Gravação em Vídeo , Adulto Jovem
18.
IEEE Trans Image Process ; 24(10): 2984-95, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26067371

RESUMO

Recognizing human activities from videos is a fundamental research problem in computer vision. Recently, there has been a growing interest in analyzing human behavior from data collected with wearable cameras. First-person cameras continuously record several hours of their wearers' life. To cope with this vast amount of unlabeled and heterogeneous data, novel algorithmic solutions are required. In this paper, we propose a multitask clustering framework for activity of daily living analysis from visual data gathered from wearable cameras. Our intuition is that, even if the data are not annotated, it is possible to exploit the fact that the tasks of recognizing everyday activities of multiple individuals are related, since typically people perform the same actions in similar environments, e.g., people working in an office often read and write documents). In our framework, rather than clustering data from different users separately, we propose to look for clustering partitions which are coherent among related tasks. In particular, two novel multitask clustering algorithms, derived from a common optimization problem, are introduced. Our experimental evaluation, conducted both on synthetic data and on publicly available first-person vision data sets, shows that the proposed approach outperforms several single-task and multitask learning methods.


Assuntos
Actigrafia/métodos , Atividades Cotidianas/classificação , Interpretação de Imagem Assistida por Computador/métodos , Movimento/fisiologia , Fotografação/métodos , Gravação em Vídeo/métodos , Humanos , Reconhecimento Automatizado de Padrão/métodos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
19.
IEEE Trans Image Process ; 23(12): 5599-611, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25361507

RESUMO

Robust action recognition under viewpoint changes has received considerable attention recently. To this end, self-similarity matrices (SSMs) have been found to be effective view-invariant action descriptors. To enhance the performance of SSM-based methods, we propose multitask linear discriminant analysis (LDA), a novel multitask learning framework for multiview action recognition that allows for the sharing of discriminative SSM features among different views (i.e., tasks). Inspired by the mathematical connection between multivariate linear regression and LDA, we model multitask multiclass LDA as a single optimization problem by choosing an appropriate class indicator matrix. In particular, we propose two variants of graph-guided multitask LDA: 1) where the graph weights specifying view dependencies are fixed a priori and 2) where graph weights are flexibly learnt from the training data. We evaluate the proposed methods extensively on multiview RGB and RGBD video data sets, and experimental results confirm that the proposed approaches compare favorably with the state-of-the-art.

20.
IEEE Trans Pattern Anal Mach Intell ; 35(3): 513-26, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-22689076

RESUMO

In the last decades, many efforts have been devoted to develop methods for automatic scene understanding in the context of video surveillance applications. This paper presents a novel nonobject centric approach for complex scene analysis. Similarly to previous methods, we use low-level cues to individuate atomic activities and create clip histograms. Differently from recent works, the task of discovering high-level activity patterns is formulated as a convex prototype learning problem. This problem results in a simple linear program that can be solved efficiently with standard solvers. The main advantage of our approach is that, using as the objective function the Earth Mover's Distance (EMD), the similarity among elementary activities is taken into account in the learning phase. To improve scalability we also consider some variants of EMD adopting L1 as ground distance for 1D and 2D, linear and circular histograms. In these cases, only the similarity between neighboring atomic activities, corresponding to adjacent histogram bins, is taken into account. Therefore, we also propose an automatic strategy for sorting atomic activities. Experimental results on publicly available datasets show that our method compares favorably with state-of-the-art approaches, often outperforming them.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA