RESUMO
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene. Given an image feature extractor, for example pre-trained using self-supervision, N3F uses it as a teacher to learn a student network defined in 3D space. The 3D student network is similar to a neural radiance field that distills said features and can be trained with the usual differentiable rendering machinery. As a consequence, N3F is readily applicable to most neural rendering formulations, including vanilla NeRF and its extensions to complex dynamic scenes. We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels,but also consistently improves over the self-supervised 2D baselines. This is demonstrated by considering various tasks, such as 2D object retrieval, 3D segmentation, and scene editing, in diverse sequences, including long egocentric videos in the EPIC-KITCHENS benchmark. Project page: https://www.robots.ox.ac.uk/~vadim/n3f/.
RESUMO
We propose an unsupervised method to learn the 3D geometry of object categories by looking around them. Differently from traditional approaches, this method does not require CAD models or manual supervision. Instead, using only video sequences showing object instances from a moving viewpoint, the method learns a deep neural network that can predict several aspects of the 3D geometry of such objects from single images. The network has three components. The first is a Siamese viewpoint factorization network that robustly aligns the input videos and learns to predict the absolute viewpoint of the object from a single image. The second is a depth estimation network that performs monocular depth prediction. The third is a shape completion network that predicts the full 3D shape of the object from the output of the monocular depth prediction module. While the three modules solve very different task, we show that they all benefit significantly from allowing networks to perform probabilistic predictions. This results in a self-assessment mechanism which is crucial for obtaining high quality predictions. Our network achieves state-of-the-art results on viewpoint prediction, depth estimation, and 3D point cloud estimation on public benchmarks.
RESUMO
Color names are required in real-world applications such as image retrieval and image annotation. Traditionally, they are learned from a collection of labeled color chips. These color chips are labeled with color names within a well-defined experimental setup by human test subjects. However, naming colors in real-world images differs significantly from this experimental setting. In this paper, we investigate how color names learned from color chips compare to color names learned from real-world images. To avoid hand labeling real-world images with color names, we use Google Image to collect a data set. Due to the limitations of Google Image, this data set contains a substantial quantity of wrongly labeled data. We propose several variants of the PLSA model to learn color names from this noisy data. Experimental results show that color names learned from real-world images significantly outperform color names learned from labeled color chips for both image retrieval and image annotation.
RESUMO
This article deals with the detection of prominent objects in images. As opposed to the standard approaches based on sliding windows, we study a fundamentally different solution by formulating the supervised prediction of a bounding box as an image retrieval task. Indeed, given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data-driven detection (DDD). Our key novelty is to design or learn image similarities that explicitly optimize some aspect of the transfer unlike previous work which uses generic representations and unsupervised similarities. In a first variant, we explicitly learn to transfer, by adapting a metric learning approach to work with image and bounding box pairs. Second, we use a representation of images as object probability maps computed from low-level patch classifiers. Experiments show that these two contributions yield in some cases comparable or better results than standard sliding window detectors - despite its conceptual simplicity and run-time efficiency. Our third contribution is an application of prominent object detection, where we improve fine-grained categorization by pre-cropping images with the proposed approach. Finally, we also extend the proposed approach to detect multiple parts of rigid objects.
RESUMO
This paper considers scalable and unobtrusive activity recognition using on-body sensing for context awareness in wearable computing. Common methods for activity recognition rely on supervised learning requiring substantial amounts of labeled training data. Obtaining accurate and detailed annotations of activities is challenging, preventing the applicability of these approaches in real-world settings. This paper proposes new annotation strategies that substantially reduce the required amount of annotation. We explore two learning schemes for activity recognition that effectively leverage such sparsely labeled data together with more easily obtainable unlabeled data. Experimental results on two public data sets indicate that both approaches obtain results close to fully supervised techniques. The proposed methods are robust to the presence of erroneous labels occurring in real-world annotation data.