Búsqueda | BVS Bolivia

Text-Based Localization of Moments in a Video Corpus.

Paul, Sudipta; Mithun, Niluthpol Chowdhury; Roy-Chowdhury, Amit K.

IEEE Trans Image Process ; 30: 8886-8899, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-34665727

RESUMEN

Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: 2) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, 2) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qualitative and quantitative results on three benchmark text-based video moment retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions - demonstrate that our method achieves promising performance on the proposed task of temporal localization of moments in a corpus of videos.

Long-Range Augmented Reality with Dynamic Occlusion Rendering.

Sizintsev, Mikhail; Mithun, Niluthpol Chowdhury; Chiu, Han-Pang; Samarasekera, Supun; Kumar, Rakesh.

IEEE Trans Vis Comput Graph ; 27(11): 4236-4244, 2021 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-34449369

RESUMEN

Proper occlusion based rendering is very important to achieve realism in all indoor and outdoor Augmented Reality (AR) applications. This paper addresses the problem of fast and accurate dynamic occlusion reasoning by real objects in the scene for large scale outdoor AR applications. Conceptually, proper occlusion reasoning requires an estimate of depth for every point in augmented scene which is technically hard to achieve for outdoor scenarios, especially in the presence of moving objects. We propose a method to detect and automatically infer the depth for real objects in the scene without explicit detailed scene modeling and depth sensing (e.g. without using sensors such as 3D-LiDAR). Specifically, we employ instance segmentation of color image data to detect real dynamic objects in the scene and use either a top-down terrain elevation model or deep learning based monocular depth estimation model to infer their metric distance from the camera for proper occlusion reasoning in real time. The realized solution is implemented in a low latency real-time framework for video-see-though AR and is directly extendable to optical-see-through AR. We minimize latency in depth reasoning and occlusion rendering by doing semantic object tracking and prediction in video frames.

Diversity-Aware Multi-Video Summarization.

Panda, Rameswar; Mithun, Niluthpol Chowdhury; Roy-Chowdhury, Amit K.

IEEE Trans Image Process ; 26(10): 4712-4724, 2017 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-28574359

RESUMEN

Most video summarization approaches have focused on extracting a summary from a single video; we propose an unsupervised framework for summarizing a collection of videos. We observe that each video in the collection may contain some information that other videos do not have, and thus exploring the underlying complementarity could be beneficial in creating a diverse informative summary. We develop a novel diversity-aware sparse optimization method for multi-video summarization by exploring the complementarity within the videos. Our approach extracts a multi-video summary, which is both interesting and representative in describing the whole video collection. To efficiently solve our optimization problem, we develop an alternating minimization algorithm that minimizes the overall objective function with respect to one video at a time while fixing the other videos. Moreover, we introduce a new benchmark data set, Tour20, that contains 140 videos with multiple manually created summaries, which were acquired in a controlled experiment. Finally, by extensive experiments on the new Tour20 data set and several other multi-view data sets, we show that the proposed approach clearly outperforms the state-of-the-art methods on the two problems-topic-oriented video summarization and multi-view video summarization in a camera network.

RESUMEN

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA