Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 43
Filtrar
1.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 6766-6782, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-34232862

RESUMO

With the increasing social demands of disaster response, methods of visual observation for rescue and safety have become increasingly important. However, because of the shortage of datasets for disaster scenarios, there has been little progress in computer vision and robotics in this field. With this in mind, we present the first large-scale synthetic dataset of egocentric viewpoints for disaster scenarios. We simulate pre- and post-disaster cases with drastic changes in appearance, such as buildings on fire and earthquakes. The dataset consists of more than 300K high-resolution stereo image pairs, all annotated with ground-truth data for the semantic label, depth in metric scale, optical flow with sub-pixel precision, and surface normal as well as their corresponding camera poses. To create realistic disaster scenes, we manually augment the effects with 3D models using physically-based graphics tools. We train various state-of-the-art methods to perform computer vision tasks using our dataset, evaluate how well these methods recognize the disaster situations, and produce reliable results of virtual scenes as well as real-world images. We also present a convolutional neural network-based egocentric localization method that is robust to drastic appearance changes, such as the texture changes in a fire, and layout changes from a collapse. To address these key challenges, we propose a new model that learns a shape-based representation by training on stylized images, and incorporate the dominant planes of query images as approximate scene coordinates. We evaluate the proposed method using various scenes including a simulated disaster dataset to demonstrate the effectiveness of our method when confronted with significant changes in scene layout. Experimental results show that our method provides reliable camera pose predictions despite vastly changed conditions.

2.
IEEE Trans Neural Netw Learn Syst ; 34(11): 8753-8763, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-35316194

RESUMO

Recent state-of-the-art active learning methods have mostly leveraged generative adversarial networks (GANs) for sample acquisition; however, GAN is usually known to suffer from instability and sensitivity to hyperparameters. In contrast to these methods, in this article, we propose a novel active learning framework that we call Maximum Classifier Discrepancy for Active Learning (MCDAL) that takes the prediction discrepancies between multiple classifiers. In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them. Intuitively, the discrepancies in the auxiliary classification layers' predictions indicate the uncertainty in the prediction. In this regard, we propose a novel method to leverage the classifier discrepancies for the acquisition function for active learning. We also provide an interpretation of our idea in relation to existing GAN-based active learning methods and domain adaptation frameworks. Moreover, we empirically demonstrate the utility of our approach where the performance of our approach exceeds the state-of-the-art methods on several image classification and semantic segmentation datasets in active learning setups.

3.
Sensors (Basel) ; 22(19)2022 Sep 28.
Artigo em Inglês | MEDLINE | ID: mdl-36236485

RESUMO

Depth perception capability is one of the essential requirements for various autonomous driving platforms. However, accurate depth estimation in a real-world setting is still a challenging problem due to high computational costs. In this paper, we propose a lightweight depth completion network for depth perception in real-world environments. To effectively transfer a teacher's knowledge, useful for the depth completion, we introduce local similarity-preserving knowledge distillation (LSPKD), which allows similarities between local neighbors to be transferred during the distillation. With our LSPKD, a lightweight student network is precisely guided by a heavy teacher network, regardless of the density of the ground-truth data. Experimental results demonstrate that our method is effective to reduce computational costs during both training and inference stages while achieving superior performance over other lightweight networks.


Assuntos
Algoritmos , Humanos
4.
IEEE Trans Image Process ; 31: 5383-5395, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35749323

RESUMO

A holistic understanding of dynamic scenes is of fundamental importance in real-world computer vision problems such as autonomous driving, augmented reality and spatio-temporal reasoning. In this paper, we propose a new computer vision benchmark: Video Panoptic Segmentation (VPS). To study this important problem, we present two datasets, Cityscapes-VPS and VIPER together with a new evaluation metric, video panoptic quality (VPQ). We also propose VPSNet++, an advanced video panoptic segmentation network, which simultaneously performs classification, detection, segmentation, and tracking of all identities in videos. Specifically, VPSNet++ builds upon a top-down panoptic segmentation network by adding pixel-level feature fusion head and object-level association head. The former temporally augments the pixel features while the latter performs object tracking. Furthermore, we propose panoptic boundary learning as an auxiliary task, and instance discrimination learning which learns spatio-temporally clustered pixel embedding for individual thing or stuff regions, i.e., exactly the objective of the video panoptic segmentation problem. Our VPSNet++ significantly outperforms the default VPSNet, i.e., FuseTrack baseline, and achieves state-of-the-art results on both Cityscapes-VPS and VIPER datasets. The datasets, metric, and models are publicly available at https://github.com/mcahny/vps.

5.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 8403-8419, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34428135

RESUMO

We propose a new linear RGB-D simultaneous localization and mapping (SLAM) formulation by utilizing planar features of the structured environments. The key idea is to understand a given structured scene and exploit its structural regularities such as the Manhattan world. This understanding allows us to decouple the camera rotation by tracking structural regularities, which makes SLAM problems free from being highly nonlinear. Additionally, it provides a simple yet effective cue for representing planar features, which leads to a linear SLAM formulation. Given an accurate camera rotation, we jointly estimate the camera translation and planar landmarks in the global planar map using a linear Kalman filter. Our linear SLAM method, called L-SLAM, can understand not only the Manhattan world but the more general scenario of the Atlanta world, which consists of a vertical direction and a set of horizontal directions orthogonal to the vertical direction. To this end, we introduce a novel tracking-by-detection scheme that infers the underlying scene structure by Atlanta representation. With efficient Atlanta representation, we formulate a unified linear SLAM framework for structured environments. We evaluate L-SLAM on a synthetic dataset and RGB-D benchmarks, demonstrating comparable performance to other state-of-the-art SLAM methods without using expensive nonlinear optimization. We assess the accuracy of L-SLAM on a practical application of augmented reality.

6.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 7348-7362, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34648432

RESUMO

We introduce dense relational captioning, a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in a visual scene. Relational captioning provides explicit descriptions for each relationship between object combinations. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding based on relationships, e.g., relational proposal generation. For relational understanding between objects, the part-of-speech (POS; i.e., subject-object-predicate categories) can be a valuable prior information to guide the causal sequence of words in a caption. We enforce our framework to learn not only to generate captions but also to understand the POS of each word. To this end, we propose the multi-task triple-stream network (MTTSNet) which consists of three recurrent units responsible for each POS which is trained by jointly predicting the correct captions and POS for each word. In addition, we found that the performance of MTTSNet can be improved by modulating the object embeddings with an explicit relational module. We demonstrate that our proposed model can generate more diverse and richer captions, via extensive experimental analysis on large scale datasets and several metrics. Then, we present applications of our framework to holistic image captioning, scene graph generation, and retrieval tasks.

7.
IEEE Trans Pattern Anal Mach Intell ; 44(9): 5460-5471, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34057889

RESUMO

Taking selfies has become one of the major photographic trends of our time. In this study, we focus on the selfie stick, on which a camera is mounted to take selfies. We observe that a camera on a selfie stick typically travels through a particular type of trajectory around a sphere. Based on this finding, we propose a robust, efficient, and optimal estimation method for relative camera pose between two images captured by a camera mounted on a selfie stick. We exploit the special geometric structure of camera motion constrained by a selfie stick and define this motion as spherical joint motion. Utilizing a novel parametrization and calibration scheme, we demonstrate that the pose estimation problem can be reduced to a 3-degrees of freedom (DoF) search problem, instead of a generic 6-DoF problem. This facilitates the derivation of an efficient branch-and-bound optimization method that guarantees a global optimal solution, even in the presence of outliers. Furthermore, as a simplified case of spherical joint motion, we introduce selfie motion, which has a fewer number of DoF than spherical joint motion. We validate the performance and guaranteed optimality of our method on both synthetic and real-world data. Additionally, we demonstrate the applicability of the proposed method for two applications: refocusing and stylization.

8.
Sensors (Basel) ; 21(20)2021 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-34696018

RESUMO

With the emerging interest of autonomous vehicles (AV), the performance and reliability of the land vehicle navigation are also becoming important. Generally, the navigation system for passenger car has been heavily relied on the existing Global Navigation Satellite System (GNSS) in recent decades. However, there are many cases in real world driving where the satellite signals are challenged; for example, urban streets with buildings, tunnels, or even underpasses. In this paper, we propose a novel method for simultaneous vehicle dead reckoning, based on the lane detection model in GNSS-denied situations. The proposed method fuses the Inertial Navigation System (INS) with learning-based lane detection model to estimate the global position of vehicle, and effectively bounds the error drift compared to standalone INS. The integration of INS and lane model is accomplished by UKF to minimize linearization errors and computing time. The proposed method is evaluated through the real-vehicle experiments on highway driving, and the comparative discussions for other dead-reckoning algorithms with the same system configuration are presented.


Assuntos
Condução de Veículo , Sistemas de Informação Geográfica , Algoritmos , Reprodutibilidade dos Testes
9.
IEEE Trans Image Process ; 30: 9150-9163, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34554914

RESUMO

A common problem in the task of human-object interaction (HOI) detection is that numerous HOI classes have only a small number of labeled examples, resulting in training sets with a long-tailed distribution. The lack of positive labels can lead to low classification accuracy for these classes. Towards addressing this issue, we observe that there exist natural correlations and anti-correlations among human-object interactions. In this paper, we model the correlations as action co-occurrence matrices and present techniques to learn these priors and leverage them for more effective training, especially on rare classes. The efficacy of our approach is demonstrated experimentally, where the performance of our approach consistently improves over the state-of-the-art methods on both of the two leading HOI detection benchmark datasets, HICO-Det and V-COCO.


Assuntos
Algoritmos , Humanos
10.
IEEE Trans Pattern Anal Mach Intell ; 43(5): 1605-1619, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-31722472

RESUMO

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.

11.
IEEE Trans Pattern Anal Mach Intell ; 43(4): 1225-1238, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-31613749

RESUMO

We propose a novel approach to infer a high-quality depth map from a set of images with small viewpoint variations. In general, techniques for depth estimation from small motion consist of camera pose estimation and dense reconstruction. In contrast to prior approaches that recover scene geometry and camera motions using pre-calibrated cameras, we introduce in this paper a self-calibrating bundle adjustment method tailored for small motion which enables computation of camera poses without the need for camera calibration. For dense depth reconstruction, we present a convolutional neural network called DPSNet (Deep Plane Sweep Network) whose design is inspired by best practices of traditional geometry-based approaches. Rather than directly estimating depth or optical flow correspondence from image pairs as done in many previous deep learning methods, DPSNet takes a plane sweep approach that involves building a cost volume from deep features using the plane sweep algorithm, regularizing the cost volume, and regressing the depth map from the cost volume. The cost volume is constructed using a differentiable warping process that allows for end-to-end training of the network. Through the effective incorporation of conventional multiview stereo concepts within a deep learning framework, the proposed method achieves state-of-the-art results on a variety of challenging datasets.

12.
Sci Rep ; 10(1): 11833, 2020 07 16.
Artigo em Inglês | MEDLINE | ID: mdl-32678265

RESUMO

The magnetic particle imaging (MPI) is a technology that can image the concentrations of the superparamagnetic iron oxide nanoparticles (SPIONs) which can be used in biomedical diagnostics and therapeutics as non-radioactive tracers. We proposed a point-of-care testing MPI system (PoCT-MPI) that can be used for preclinical use for imaging small rodents (mice) injected with SPIONs not only in laboratories, but also at emergency sites far from laboratories. In particular, we applied a frequency mixing magnetic detection method to the PoCT-MPI, and proposed a hybrid field free line generator to reduce the power consumption, size and weight of the system. The PoCT-MPI is [Formula: see text] in size and weighs less than 100 kg. It can image a three-dimensional distribution of SPIONs injected into a biosample with less than 120 Wh of power consumption. Its detection limit is [Formula: see text], 10 mg/mL, [Formula: see text] (Fe).


Assuntos
Encéfalo/diagnóstico por imagem , Processamento de Imagem Assistida por Computador/estatística & dados numéricos , Imageamento Tridimensional/métodos , Nanopartículas Magnéticas de Óxido de Ferro/administração & dosagem , Testes Imediatos , Animais , Humanos , Imageamento Tridimensional/instrumentação , Limite de Detecção , Nanopartículas Magnéticas de Óxido de Ferro/química , Fenômenos Magnéticos , Masculino , Camundongos , Camundongos Endogâmicos C57BL , Ratos , Ratos Sprague-Dawley
13.
IEEE Trans Pattern Anal Mach Intell ; 42(1): 232-245, 2020 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-30281438

RESUMO

While conventional calibrated photometric stereo methods assume that light intensities and sensor exposures are known or unknown but identical across observed images, this assumption easily breaks down in practical settings due to individual light bulb's characteristics and limited control over sensors. This paper studies the effect of unknown and possibly non-uniform light intensities and sensor exposures among observed images on the shape recovery based on photometric stereo. This leads to the development of a "semi-calibrated" photometric stereo method, where the light directions are known but light intensities (and sensor exposures) are unknown. We show that the semi-calibrated photometric stereo becomes a bilinear problem, whose general form is difficult to solve, but in the photometric stereo context, there exists a unique solution for the surface normal and light intensities (or sensor exposures). We further show that there exists a linear solution method for the problem, and develop efficient and stable solution methods. The semi-calibrated photometric stereo is advantageous over conventional calibrated photometric stereo in accurate determination of surface normal, because it relaxes the assumption of known light intensity ratios/sensor exposures. The experimental results show superior accuracy of the semi-calibrated photometric stereo in comparison to conventional methods in practical settings.

14.
IEEE Trans Pattern Anal Mach Intell ; 42(10): 2656-2669, 2020 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30969915

RESUMO

In this work, we describe man-made structures via an appropriate structure assumption, called the Atlanta world assumption, which contains a vertical direction (typically the gravity direction) and a set of horizontal directions orthogonal to the vertical direction. Contrary to the commonly used Manhattan world assumption, the horizontal directions in Atlanta world are not necessarily orthogonal to each other. While Atlanta world can encompass a wider range of scenes, this makes the search space much larger and the problem more challenging. Our input data is a set of surface normals, for example, acquired from RGB-D cameras or 3D laser scanners, as well as lines from calibrated images. Given this input data, we propose the first globally optimal method of inlier set maximization for Atlanta direction estimation. We define a novel search space for Atlanta world, as well as its parametrization, and solve this challenging problem using a branch-and-bound (BnB) framework. To alleviate the computational bottleneck in BnB, i.e., the bound computation, we present two bound computation strategies: rectangular bound and slice bound in an efficient measurement domain, i.e., the extended Gaussian image (EGI). In addition, we propose an efficient two-stage method which automatically estimates the number of horizontal directions of a scene. Experimental results with synthetic and real-world datasets have successfully confirmed the validity of our approach.

15.
IEEE Trans Pattern Anal Mach Intell ; 42(5): 1038-1052, 2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-31831407

RESUMO

Video inpainting aims to fill in spatio-temporal holes in videos with plausible content. Despite tremendous progress on deep learning-based inpainting of a single image, it is still challenging to extend these methods to video domain due to the additional time dimension. In this paper, we propose a recurrent temporal aggregation framework for fast deep video inpainting. In particular, we construct an encoder-decoder model, where the encoder takes multiple reference frames which can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a recurrent feedback in an auto-regressive manner to enforce temporal consistency in the video results. We propose two architectural designs based on this framework. Our first model is a blind video decaptioning network (BVDNet) that is designed to automatically remove and inpaint text overlays in videos without any mask information. Our BVDNet wins the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track 2: Video Decaptioning. Second, we propose a network for more general video inpainting (VINet) to deal with more arbitrary and larger holes. Video results demonstrate the advantage of our framework compared to state-of-the-art methods both qualitatively and quantitatively. The codes are available at https://github.com/mcahny/Deep-Video-Inpainting, and https://github.com/shwoo93/video_decaptioning.

16.
Artigo em Inglês | MEDLINE | ID: mdl-31478856

RESUMO

Depth from focus (DfF) is a method of estimating the depth of a scene by using information acquired through changes in the focus of a camera. Within the DfF framework of, the focus measure (FM) forms the foundation which determines the accuracy of the output. With the results from the FM, the role of a DfF pipeline is to determine and recalculate unreliable measurements while enhancing those that are reliable. In this paper, we propose a new FM, which we call the "ring difference filter" (RDF), that can more accurately and robustly measure focus. FMs can usually be categorized as confident local methods or noise robust non-local methods. The RDF's unique ring-and-disk structure allows it to have the advantages of both local and non-local FMs. We then describe an efficient pipeline that utilizes the RDF's properties. Part of this pipeline is our proposed RDF-based cost aggregation method, which is able to robustly refine the initial results in the presence of image noise. Our method is able to reproduce results that are on par with or even better than those of state-of-the-art methods, while spending less time in computation.

17.
IEEE Trans Pattern Anal Mach Intell ; 41(3): 682-696, 2019 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-29993475

RESUMO

Most man-made environments, such as urban and indoor scenes, consist of a set of parallel and orthogonal planar structures. These structures are approximated by the Manhattan world assumption, in which notion can be represented as a Manhattan frame (MF). Given a set of inputs such as surface normals or vanishing points, we pose an MF estimation problem as a consensus set maximization that maximizes the number of inliers over the rotation search space. Conventionally, this problem can be solved by a branch-and-bound framework, which mathematically guarantees global optimality. However, the computational time of the conventional branch-and-bound algorithms is rather far from real-time. In this paper, we propose a novel bound computation method on an efficient measurement domain for MF estimation, i.e., the extended Gaussian image (EGI). By relaxing the original problem, we can compute the bound with a constant complexity, while preserving global optimality. Furthermore, we quantitatively and qualitatively demonstrate the performance of the proposed method for various synthetic and real-world data. We also show the versatility of our approach through three different applications: extension to multiple MF estimation, 3D rotation based video stabilization, and vanishing point estimation (line clustering).

18.
IEEE Trans Pattern Anal Mach Intell ; 41(4): 775-787, 2019 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-29993773

RESUMO

Structure from small motion has become an important topic in 3D computer vision as a method for estimating depth, since capturing the input is so user-friendly. However, major limitations exist with respect to the form of depth uncertainty, due to the narrow baseline and the rolling shutter effect. In this paper, we present a dense 3D reconstruction method from small motion clips using commercial hand-held cameras, which typically cause the undesired rolling shutter artifact. To address these problems, we introduce a novel small motion bundle adjustment that effectively compensates for the rolling shutter effect. Moreover, we propose a pipeline for a fine-scale dense 3D reconstruction that models the rolling shutter effect by utilizing both sparse 3D points and the camera trajectory from narrow-baseline images. In this reconstruction, the sparse 3D points are propagated to obtain an initial depth hypothesis using a geometry guidance term. Then, the depth information on each pixel is obtained by sweeping the plane around each depth search space near the hypothesis. The proposed framework shows accurate dense reconstruction results suitable for various sought-after applications. Both qualitative and quantitative evaluations show that our method consistently generates better depth maps compared to state-of-the-art methods.

19.
IEEE Trans Pattern Anal Mach Intell ; 41(2): 297-310, 2019 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-29994179

RESUMO

One of the core applications of light field imaging is depth estimation. To acquire a depth map, existing approaches apply a single photo-consistency measure to an entire light field. However, this is not an optimal choice because of the non-uniform light field degradations produced by limitations in the hardware design. In this paper, we introduce a pipeline that automatically determines the best configuration for photo-consistency measure, which leads to the most reliable depth label from the light field. We analyzed the practical factors affecting degradation in lenslet light field cameras, and designed a learning based framework that can retrieve the best cost measure and optimal depth label. To enhance the reliability of our method, we augmented an existing light field benchmark to simulate realistic source dependent noise, aberrations, and vignetting artifacts. The augmented dataset was used for the training and validation of the proposed approach. Our method was competitive with several state-of-the-art methods for the benchmark and real-world light field datasets.

20.
IEEE Trans Image Process ; 28(3): 1054-1067, 2019 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-30281457

RESUMO

We propose a deep convolutional neural network (CNN) method for natural image matting. Our method takes multiple initial alpha mattes of the previous methods and normalized RGB color images as inputs, and directly learns an end-to-end mapping between the inputs and reconstructed alpha mattes. Among the various existing methods, we focus on using two simple methods as initial alpha mattes: the closed-form matting and KNN matting. They are complementary to each other in terms of local and nonlocal principles. A major benefit of our method is that it can "recognize" different local image structures and then combine the results of local (closed-form matting) and nonlocal (KNN matting) mattings effectively to achieve higher quality alpha mattes than both of the inputs. Furthermore, we verify extendability of the proposed network to different combinations of initial alpha mattes from more advanced techniques such as KL divergence matting and information-flow matting. On the top of deep CNN matting, we build an RGB guided JPEG artifacts removal network to handle JPEG block artifacts in alpha matting. Extensive experiments demonstrate that our proposed deep CNN matting produces visually and quantitatively high-quality alpha mattes. We perform deeper experiments including studies to evaluate the importance of balancing training data and to measure the effects of initial alpha mattes and also consider results from variant versions of the proposed network to analyze our proposed DCNN matting. In addition, our method achieved high ranking in the public alpha matting evaluation dataset in terms of the sum of absolute differences, mean squared errors, and gradient errors. Also, our RGB guided JPEG artifacts removal network restores the damaged alpha mattes from compressed images in JPEG format.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...