RESUMO
MoCA is a bi-modal dataset in which we collect Motion Capture data and video sequences acquired from multiple views, including an ego-like viewpoint, of upper body actions in a cooking scenario. It has been collected with the specific purpose of investigating view-invariant action properties in both biological and artificial systems. Besides that, it represents an ideal test bed for research in a number of fields - including cognitive science and artificial vision - and application domains - as motor control and robotics. Compared to other benchmarks available, MoCA provides a unique compromise for research communities leveraging very different approaches to data gathering: from one extreme of action recognition in the wild - the standard practice nowadays in the fields of Computer Vision and Machine Learning - to motion analysis in very controlled scenarios - as for motor control in biomedical applications. In this work we introduce the dataset and its peculiarities, and discuss a baseline analysis as well as examples of applications for which the dataset is well suited.
Assuntos
Culinária/métodos , Movimento , Fenômenos Biomecânicos , Humanos , Gravação em VídeoRESUMO
In the industry of the future, so as in healthcare and at home, robots will be a familiar presence. Since they will be working closely with human operators not always properly trained for human-machine interaction tasks, robots will need the ability of automatically adapting to changes in the task to be performed or to cope with variations in how the human partner completes the task. The goal of this work is to make a further step toward endowing robot with such capability. To this purpose, we focus on the identification of relevant time instants in an observed action, called dynamic instants, informative on the partner's movement timing, and marking instants where an action starts or ends, or changes to another action. The time instants are temporal locations where the motion can be ideally segmented, providing a set of primitives that can be used to build a temporal signature of the action and finally support the understanding of the dynamics and coordination in time. We validate our approach in two contexts, considering first a situation in which the human partner can perform multiple different activities, and then moving to settings where an action is already recognized and shows a certain degree of periodicity. In the two contexts we address different challenges. In the first one, working in batch on a dataset collecting videos of a variety of cooking activities, we investigate whether the action signature we compute could facilitate the understanding of which type of action is occurring in front of the observer, with tolerance to viewpoint changes. In the second context, we evaluate online on the robot iCub the capability of the action signature in providing hints to establish an actual temporal coordination during the interaction with human participants. In both cases, we show promising results that speak in favor of the potentiality of our approach.
RESUMO
Shearlets are a relatively new directional multi-scale framework for signal analysis, which have been shown effective to enhance signal discontinuities, such as edges and corners at multiple scales even in the presence of a large quantity of noise. In this paper, we consider blob-like features in the shearlets framework. We derive a measure, which is very effective for blob detection, and, based on this measure, we propose a blob detector and a keypoint description, whose combination outperforms the state-of-the-art algorithms with noisy and compressed images. We also demonstrate that the measure satisfies the perfect scale invariance property in the continuous case. We evaluate the robustness of our algorithm to different types of noise, including blur, compression artifacts, and Gaussian noise. Furthermore, we carry on a comparative analysis on benchmark data, referring, in particular, to tolerance to noise and image compression.
RESUMO
In this paper, we propose a sparse coding approach to background modeling. The obtained model is based on dictionaries which we learn and keep up to date as new data are provided by a video camera. We observe that, without dynamic events, video frames may be seen as noisy data belonging to the background. Over time, such background is subject to local and global changes due to variable illumination conditions, camera jitter, stable scene changes, and intermittent motion of background objects. To capture the locality of some changes, we propose a space-variant analysis where we learn a dictionary of atoms for each image patch, the size of which depends on the background variability. At run time, each patch is represented by a linear combination of the atoms learnt online. A change is detected when the atoms are not sufficient to provide an appropriate representation, and stable changes over time trigger an update of the current dictionary. Even if the overall procedure is carried out at a coarse level, a pixel-wise segmentation can be obtained by comparing the atoms with the patch corresponding to the dynamic event. Experiments on benchmarks indicate that the proposed method achieves very good performances on a variety of scenarios. An assessment on long video streams confirms our method incorporates periodical changes, as the ones caused by variations in natural illumination. The model, fully data driven, is suitable as a main component of a change detection system.