Pesquisa | BVS CLAP/SMR-OPAS/OMS

A Survey of Self-Supervised and Few-Shot Object Detection.

Huang, Gabriel; Laradji, Issam; Vazquez, David; Lacoste-Julien, Simon; Rodriguez, Pau.

IEEE Trans Pattern Anal Mach Intell ; 45(4): 4071-4089, 2023 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-35976841

RESUMO

Labeling data is often expensive and time-consuming, especially for tasks such as object detection and instance segmentation, which require dense labeling of the image. While few-shot object detection is about training a model on novel (unseen) object classes with little data, it still requires prior training on many labeled examples of base (seen) classes. On the other hand, self-supervised methods aim at learning representations from unlabeled data which transfer well to downstream tasks such as object detection. Combining few-shot and self-supervised object detection is a promising research direction. In this survey, we review and characterize the most recent approaches on few-shot and self-supervised object detection. Then, we give our main takeaways and discuss future research directions. Project page: https://gabrielhuang.github.io/fsod-survey/.

Scattering Networks for Hybrid Representation Learning.

Oyallon, Edouard; Zagoruyko, Sergey; Huang, Gabriel; Komodakis, Nikos; Lacoste-Julien, Simon; Blaschko, Matthew; Belilovsky, Eugene.

IEEE Trans Pattern Anal Mach Intell ; 41(9): 2208-2221, 2019 09.

Artigo em Inglês | MEDLINE | ID: mdl-30028690

RESUMO

Scattering networks are a class of designed Convolutional Neural Networks (CNNs) with fixed weights. We argue they can serve as generic representations for modelling images. In particular, by working in scattering space, we achieve competitive results both for supervised and unsupervised learning tasks, while making progress towards constructing more interpretable CNNs. For supervised learning, we demonstrate that the early layers of CNNs do not necessarily need to be learned, and can be replaced with a scattering network instead. Indeed, using hybrid architectures, we achieve the best results with predefined representations to-date, while being competitive with end-to-end learned CNNs. Specifically, even applying a shallow cascade of small-windowed scattering coefficients followed by $1\times 1$1×1-convolutions results in AlexNet accuracy on the ILSVRC2012 classification task. Moreover, by combining scattering networks with deep residual networks, we achieve a single-crop top-5 error of 11.4 percent on ILSVRC2012. Also, we show they can yield excellent performance in the small sample regime on CIFAR-10 and STL-10 datasets, exceeding their end-to-end counterparts, through their ability to incorporate geometrical priors. For unsupervised learning, scattering coefficients can be a competitive representation that permits image recovery. We use this fact to train hybrid GANs to generate images. Finally, we empirically analyze several properties related to stability and reconstruction of images from scattering coefficients.

Learning from Narrated Instruction Videos.

Alayrac, Jean-Baptiste; Bojanowski, Piotr; Agrawal, Nishant; Sivic, Josef; Laptev, Ivan; Lacoste-Julien, Simon.

IEEE Trans Pattern Anal Mach Intell ; 40(9): 2194-2208, 2018 09.

Artigo em Inglês | MEDLINE | ID: mdl-28885149

RESUMO

Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos.

Assuntos

Internet , Processamento de Linguagem Natural , Aprendizado de Máquina não Supervisionado , Gravação em Vídeo , Análise por Conglomerados , Bases de Dados Factuais , Disseminação de Informação , Narração

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA