Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Image Process ; 33: 2689-2702, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38536682

RESUMO

Unsupervised object discovery (UOD) refers to the task of discriminating the whole region of objects from the background within a scene without relying on labeled datasets, which benefits the task of bounding-box-level localization and pixel-level segmentation. This task is promising due to its ability to discover objects in a generic manner. We roughly categorize existing techniques into two main directions, namely the generative solutions based on image resynthesis, and the clustering methods based on self-supervised models. We have observed that the former heavily relies on the quality of image reconstruction, while the latter shows limitations in effectively modeling semantic correlations. To directly target at object discovery, we focus on the latter approach and propose a novel solution by incorporating weakly-supervised contrastive learning (WCL) to enhance semantic information exploration. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images, which is achieved by fine-tuning the feature encoder of a self-supervised model, namely DINO, via WCL. Subsequently, we introduce Principal Component Analysis (PCA) to localize object regions. The principal projection direction, corresponding to the maximal eigenvalue, serves as an indicator of the object region(s). Extensive experiments on benchmark unsupervised object discovery datasets demonstrate the effectiveness of our proposed solution. The source code and experimental results are publicly available via our project page at https://github.com/npucvr/WSCUOD.git.

2.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3862-3879, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38190689

RESUMO

Rolling shutter temporal super-resolution (RSSR), which aims to synthesize intermediate global shutter (GS) video frames between two consecutive rolling shutter (RS) frames, has made remarkable progress with the development of deep convolutional neural networks over the past years. Existing methods cascade multiple separated networks to sequentially estimate intermediate motion fields and synthesize target GS frames. Nevertheless, they are typically complex, do not facilitate the interaction of complementary motion and appearance information, and suffer from problems such as pixel aliasing or poor interpretation. In this paper, we derive the uniform bilateral motion fields for RS-aware backward warping, which endows our network a more explicit geometric meaning by injecting spatio-temporal consistency information through time-offset embedding. More importantly, we develop a unified, single-stage RSSR pipeline to recover the latent GS video in a coarse-to-fine manner. It first extracts pyramid features from given inputs, and then refines the bilateral motion fields together with the anchor frame until generating the desired output. With the help of our proposed bilateral cost volume, which uses the anchor frame as a common reference to model the correlation with two RS frames, the gradually refined anchor frames not only facilitate intermediate motion estimation, but also compensate for contextual details, making additional frame synthesis or refinement networks unnecessary. Meanwhile, an asymmetric bilateral motion model built on top of the symmetric bilateral motion model further improves the generality and adaptability, yielding better GS video reconstruction performance. Extensive quantitative and qualitative experiments on synthetic and real data demonstrate that our method achieves new state-of-the-art results.

3.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3137-3155, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38090832

RESUMO

Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 14301-14320, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37590113

RESUMO

Due to the domain differences and unbalanced disparity distribution across multiple datasets, current stereo matching approaches are commonly limited to a specific dataset and generalize poorly to others. Such domain shift issue is usually addressed by substantial adaptation on costly target-domain ground-truth data, which cannot be easily obtained in practical settings. In this paper, we propose to dig into uncertainty estimation for robust stereo matching. Specifically, to balance the disparity distribution, we employ a pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, in this way driving the network progressively prune out the space of unlikely correspondences. Then, to solve the limited ground truth data, an uncertainty-based pseudo-label is proposed to adapt the pre-trained model to the new domain, where pixel-level and area-level uncertainty estimation are proposed to filter out the high-uncertainty pixels of predicted disparity maps and generate sparse while reliable pseudo-labels to align the domain gap. Experimentally, our method shows strong cross-domain, adapt, and joint generalization and obtains 1st place on the stereo task of Robust Vision Challenge 2020. Additionally, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised way and even achieves comparable performance with the supervised methods.

5.
Artigo em Inglês | MEDLINE | ID: mdl-37022903

RESUMO

Single image dehazing is a challenging and illposed problem due to severe information degeneration of images captured in hazy conditions. Remarkable progresses have been achieved by deep-learning based image dehazing methods, where residual learning is commonly used to separate the hazy image into clear and haze components. However, the nature of low similarity between haze and clear components is commonly neglected, while the lack of constraint of contrastive peculiarity between the two components always restricts the performance of these approaches. To deal with these problems, we propose an end-to-end self-regularized network (TUSR-Net) which exploits the contrastive peculiarity of different components of the hazy image, i.e, self-regularization (SR). In specific, the hazy image is separated into clear and hazy components and constraint between different image components, i.e., self-regularization, is leveraged to pull the recovered clear image closer to groundtruth, which largely promotes the performance of image dehazing. Meanwhile, an effective triple unfolding framework combined with dual feature to pixel attention is proposed to intensify and fuse the intermediate information in feature, channel and pixel levels, respectively, thus features with better representational ability can be obtained. Our TUSR-Net achieves better trade-off between performance and parameter size with weight-sharing strategy and is much more flexible. Experiments on various benchmarking datasets demonstrate the superiority of our TUSR-Net over state-of-the-art single image dehazing methods.

6.
IEEE Trans Neural Netw Learn Syst ; 34(10): 7796-7809, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-35143404

RESUMO

In stereo matching, various learning-based approaches have shown impressive performance in solving traditional difficulties on multiple datasets. While most progress is obtained on a specific dataset with a dataset-specific network design, the performance on the single dataset and cross dataset affected by training strategy is often ignored. In this article, we analyze the relationship between different training strategies and performance by retraining some representative state-of-the-art methods (e.g., geometry and context network (GC-Net), pyramid stereo matching network (PSM-Net), and guided aggregation network (GA-Net), etc.). According to our research, it is surprising that the performance of networks on single or cross datasets is significantly improved by pre-training and data augmentation without any particular structure acquirement. Based on this discovery, we improve our previous non-local context attention network (NLCA-Net) to NLCA-Net v2 and train it with the novel strategy and rethink the training strategy of stereo matching concurrently. The quantitative experiments demonstrate that: 1) our model is capable of reaching top performance on both the single dataset and the multiple datasets with the same parameters in this study, which also won the 2nd place in the stereo task of the ECCV Robust vision Challenge 2020 (RVC 2020); and 2) on small datasets (e.g., KITTI, ETH3D, and Middlebury), the model's generalization and robustness are significantly affected by pre-training and data augmentation, even exceeding the network structure's influence in some cases. These observations present a challenge to the conventional wisdom of network architectures in this stage. We expect these discoveries to encourage researchers to rethink the current paradigm of "excessive attention on the performance of a single small dataset" in stereo matching.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 6214-6230, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-36269907

RESUMO

A single rolling-shutter (RS) image may be viewed as a row-wise combination of a sequence of global-shutter (GS) images captured by a (virtual) moving GS camera within the exposure duration. Although rolling-shutter cameras are widely used, the RS effect causes obvious image distortion especially in the presence of fast camera motion, hindering downstream computer vision tasks. In this paper, we propose to invert the rolling-shutter image capture mechanism, i.e., recovering a continuous high framerate global-shutter video from two time-consecutive RS frames. We call this task the RS temporal super-resolution (RSSR) problem. The RSSR is a very challenging task, and to our knowledge, no practical solution exists to date. This paper presents a novel deep-learning based solution. By leveraging the multi-view geometry relationship of the RS imaging process, our learning based framework successfully achieves high framerate GS generation. Specifically, three novel contributions can be identified: (i) novel formulations for bidirectional RS undistortion flows under constant velocity as well as constant acceleration motion model. (ii) a simple linear scaling operation, which bridges the RS undistortion flow and regular optical flow. (iii) a new mutual conversion scheme between varying RS undistortion flows that correspond to different scanlines. Our method also exploits the underlying spatial-temporal geometric relationships within a deep learning framework, where no additional supervision is required beyond the necessary middle-scanline GS image. Building upon these contributions, this paper represents the very first rolling-shutter temporal super-resolution deep-network that is able to recover high framerate global-shutter videos from just two RS frames. Extensive experimental results on both synthetic and real data show that our proposed method can produce high-quality GS image sequences with rich details, outperforming the state-of-the-art methods.

8.
IEEE Trans Image Process ; 31: 7237-7251, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36374880

RESUMO

Event cameras such as DAVIS can simultaneously output high temporal resolution events and low frame-rate intensity images, which own great potential in capturing scene motion, such as optical flow estimation. Most of the existing optical flow estimation methods are based on two consecutive image frames and can only estimate discrete flow at a fixed time interval. Previous work has shown that continuous flow estimation can be achieved by changing the quantities or time intervals of events. However, they are difficult to estimate reliable dense flow, especially in the regions without any triggered events. In this paper, we propose a novel deep learning-based dense and continuous optical flow estimation framework from a single image with event streams, which facilitates the accurate perception of high-speed motion. Specifically, we first propose an event-image fusion and correlation module to effectively exploit the internal motion from two different modalities of data. Then we propose an iterative update network structure with bidirectional training for optical flow prediction. Therefore, our model can estimate reliable dense flow as two-frame-based methods, as well as estimate temporal continuous flow as event-based methods. Extensive experimental results on both synthetic and real captured datasets demonstrate that our model outperforms existing event-based state-of-the-art methods and our designed baselines for accurate dense and continuous optical flow estimation.

9.
IEEE Trans Image Process ; 31: 1641-1656, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35081028

RESUMO

In this paper, we propose a relative pose estimation algorithm for micro-lens array (MLA)-based conventional light field (LF) cameras. First, by employing the matched LF-point pairs, we establish the LF-point-LF-point correspondence model to represent the correlation between LF features of the same 3D scene point in a pair of LFs. Then, we employ the proposed correspondence model to estimate the relative camera pose, which includes a linear solution and a non-linear optimization on manifold. Unlike prior related algorithms, which estimated relative poses based on the recovered depths of scene points, we adopt the estimated disparities to avoid the inaccuracy in recovering depths due to the ultra-small baseline between sub-aperture images of LF cameras. Experimental results on both simulated and real scene data have demonstrated the effectiveness of the proposed algorithm compared with classical as well as state-of-art relative pose estimation algorithms.

10.
IEEE Trans Pattern Anal Mach Intell ; 44(5): 2519-2533, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-33166250

RESUMO

Event-based cameras measure intensity changes (called 'events') with microsecond accuracy under high-speed motion and challenging lighting conditions. With the 'active pixel sensor' (APS), the 'Dynamic and Active-pixel Vision Sensor' (DAVIS) allows the simultaneous output of intensity frames and events. However, the output images are captured at a relatively low frame rate and often suffer from motion blur. A blurred image can be regarded as the integral of a sequence of latent images, while events indicate changes between the latent images. Thus, we are able to model the blur-generation process by associating event data to a latent sharp image. Based on the abundant event data alongside a low frame rate, easily blurred images, we propose a simple yet effective approach to reconstruct high-quality and high frame rate sharp videos. Starting with a single blurred frame and its event data from DAVIS, we propose the Event-based Double Integral (EDI) model and solve it by adding regularization terms. Then, we extend it to multiple Event-based Double Integral (mEDI) model to get more smooth results based on multiple images and their events. Furthermore, we provide a new and more efficient solver to minimize the proposed energy model. By optimizing the energy function, we achieve significant improvements in removing blur and the reconstruction of a high temporal resolution video. The video generation is based on solving a simple non-convex optimization problem in a single scalar variable. Experimental results on both synthetic and real datasets demonstrate the superiority of our mEDI model and optimization method compared to the state-of-the-art.

11.
IEEE Trans Pattern Anal Mach Intell ; 44(9): 5761-5779, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-33856982

RESUMO

We propose the first stochastic framework to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection models treat this task as a point estimation problem by predicting a single saliency map following a deterministic learning pipeline. We argue that, however, the deterministic solution is relatively ill-posed. Inspired by the saliency data labeling process, we propose a generative architecture to achieve probabilistic RGB-D saliency detection which utilizes a latent variable to model the labeling variations. Our framework includes two main models: 1) a generator model, which maps the input image and latent variable to stochastic saliency prediction, and 2) an inference model, which gradually updates the latent variable by sampling it from the true or approximate posterior distribution. The generator model is an encoder-decoder saliency network. To infer the latent variable, we introduce two different solutions: i) a Conditional Variational Auto-encoder with an extra encoder to approximate the posterior distribution of the latent variable; and ii) an Alternating Back-Propagation technique, which directly samples the latent variable from the true posterior distribution. Qualitative and quantitative results on six challenging RGB-D benchmark datasets show our approach's superior performance in learning the distribution of saliency maps. The source code is publicly available via our project page: https://github.com/JingZhang617/UCNet.

12.
IEEE Trans Image Process ; 30: 4691-4705, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33900917

RESUMO

The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.

13.
IEEE Trans Pattern Anal Mach Intell ; 43(8): 2866-2873, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-33351750

RESUMO

The advances made in predicting visual saliency using deep neural networks come at the expense of collecting large-scale annotated data. However, pixel-wise annotation is labor-intensive and overwhelming. In this paper, we propose to learn saliency prediction from a single noisy labelling, which is easy to obtain (e.g., from imperfect human annotation or from unsupervised saliency prediction methods). With this goal, we address a natural question: Can we learn saliency prediction while identifying clean labels in a unified framework? To answer this question, we call on the theory of robust model fitting and formulate deep saliency prediction from a single noisy labelling as robust network learning and exploit model consistency across iterations to identify inliers and outliers (i.e., noisy labels). Extensive experiments on different benchmark datasets demonstrate the superiority of our proposed framework, which can learn comparable saliency prediction with state-of-the-art fully supervised saliency methods. Furthermore, we show that simply by treating ground truth annotations as noisy labelling, our framework achieves tangible improvements over state-of-the-art methods.

14.
IEEE Trans Pattern Anal Mach Intell ; 43(5): 1705-1717, 2021 May.
Artigo em Inglês | MEDLINE | ID: mdl-31765303

RESUMO

This work addresses the task of dense 3D reconstruction of a complex dynamic scene from images. The prevailing idea to solve this task is composed of a sequence of steps and is dependent on the success of several pipelines in its execution. To overcome such limitations with the existing algorithm, we propose a unified approach to solve this problem. We assume that a dynamic scene can be approximated by numerous piecewise planar surfaces, where each planar surface enjoys its own rigid motion, and the global change in the scene between two frames is as-rigid-as-possible (ARAP). Consequently, our model of a dynamic scene reduces to a soup of planar structures and rigid motion of these local planar structures. Using planar over-segmentation of the scene, we reduce this task to solving a "3D jigsaw puzzle" problem. Hence, the task boils down to correctly assemble each rigid piece to construct a 3D shape that complies with the geometry of the scene under the ARAP assumption. Further, we show that our approach provides an effective solution to the inherent scale-ambiguity in structure-from-motion under perspective projection. We provide extensive experimental results and evaluation on several benchmark datasets. Quantitative comparison with competing approaches shows state-of-the-art performance.

15.
Artigo em Inglês | MEDLINE | ID: mdl-31613765

RESUMO

Stereo videos for the dynamic scenes often show unpleasant blurred effects due to the camera motion and the multiple moving objects with large depth variations. Given consecutive blurred stereo video frames, we aim to recover the latent clean images, estimate the 3D scene flow and segment the multiple moving objects. These three tasks have been previously addressed separately, which fail to exploit the internal connections among these tasks and cannot achieve optimality. In this paper, we propose to jointly solve these three tasks in a unified framework by exploiting their intrinsic connections. To this end, we represent the dynamic scenes with the piece-wise planar model, which exploits the local structure of the scene and expresses various dynamic scenes. Under our model, these three tasks are naturally connected and expressed as the parameter estimation of 3D scene structure and camera motion (structure and motion for the dynamic scenes). By exploiting the blur model constraint, the moving objects and the 3D scene structure, we reach an energy minimization formulation for joint deblurring, scene flow and segmentation. We evaluate our approach extensively on both synthetic datasets and publicly available real datasets with fast-moving objects, camera motion, uncontrolled lighting conditions and shadows. Experimental results demonstrate that our method can achieve significant improvement in stereo video deblurring, scene flow estimation and moving object segmentation, over state-of-the-art methods.

16.
IEEE Trans Pattern Anal Mach Intell ; 35(9): 2238-51, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23868782

RESUMO

The Sturm-Triggs type iteration is a classic approach for solving the projective structure-from-motion (SfM) factorization problem, which iteratively solves the projective depths, scene structure, and camera motions in an alternated fashion. Like many other iterative algorithms, the Sturm-Triggs iteration suffers from common drawbacks, such as requiring a good initialization, the iteration may not converge or may only converge to a local minimum, and so on. In this paper, we formulate the projective SfM problem as a novel and original element-wise factorization (i.e., Hadamard factorization) problem, as opposed to the conventional matrix factorization. Thanks to this formulation, we are able to solve the projective depths, structure, and camera motions simultaneously by convex optimization. To address the scalability issue, we adopt a continuation-based algorithm. Our method is a global method, in the sense that it is guaranteed to obtain a globally optimal solution up to relaxation gap. Another advantage is that our method can handle challenging real-world situations such as missing data and outliers quite easily, and all in a natural and unified manner. Extensive experiments on both synthetic and real images show comparable results compared with the state-of-the-art methods.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...