Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros

Base de dados
Assunto principal
Tipo de documento
Intervalo de ano de publicação
1.
IEEE Trans Image Process ; 33: 4159-4172, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38985554

RESUMO

2D-3D joint learning is essential and effective for fundamental 3D vision tasks, such as 3D semantic segmentation, due to the complementary information these two visual modalities contain. Most current 3D scene semantic segmentation methods process 2D images "as they are", i.e., only real captured 2D images are used. However, such captured 2D images may be redundant, with abundant occlusion and/or limited field of view (FoV), leading to poor performance for the current methods involving 2D inputs. In this paper, we propose a general learning framework for joint 2D-3D scene understanding by selecting informative virtual 2D views of the underlying 3D scene. We then feed both the 3D geometry and the generated virtual 2D views into any joint 2D-3D-input or pure 3D-input based deep neural models for improving 3D scene understanding. Specifically, we generate virtual 2D views based on an information score map learned from the current 3D scene semantic segmentation results. To achieve this, we formalize the learning of the information score map as a deep reinforcement learning process, which rewards good predictions using a deep neural network. To obtain a compact set of virtual 2D views that jointly cover informative surfaces of the 3D scene as much as possible, we further propose an efficient greedy virtual view coverage strategy in the normal-sensitive 6D space, including 3-dimensional point coordinates and 3-dimensional normal. We have validated our proposed framework for various joint 2D-3D-input or pure 3D-input based deep neural models on two real-world 3D scene datasets, i.e., ScanNet v2 and S3DIS, and the results demonstrate that our method obtains a consistent gain over baseline models and achieves new top accuracy for joint 2D and 3D scene semantic segmentation. Code is available at https://github.com/smy-THU/VirtualViewSelection.

2.
Artigo em Inglês | MEDLINE | ID: mdl-39012751

RESUMO

Neural radiance fields (NeRF) have achieved great success in novel view synthesis and 3D representation for static scenarios. Existing dynamic NeRFs usually exploit a locally dense grid to fit the deformation fields; however, they fail to capture the global dynamics and concomitantly yield models of heavy parameters. We observe that the 4D space is inherently sparse. Firstly, the deformation fields are sparse in spatial but dense in temporal due to the continuity of motion. Secondly, the radiance fields are only valid on the surface of the underlying scene, usually occupying a small fraction of the whole space. We thus represent the 4D scene using a learnable sparse latent space, a.k.a. SLS4D. Specifically, SLS4D first uses dense learnable time slot features to depict the temporal space, from which the deformation fields are fitted with linear multi-layer perceptions (MLP) to predict the displacement of a 3D position at any time. It then learns the spatial features of a 3D position using another sparse latent space. This is achieved by learning the adaptive weights of each latent feature with the attention mechanism. Extensive experiments demonstrate the effectiveness of our SLS4D: It achieves the best 4D novel view synthesis using only about 6% parameters of the most recent work.

3.
Artigo em Inglês | MEDLINE | ID: mdl-38889040

RESUMO

High-fidelity online 3D scene reconstruction from monocular videos continues to be challenging, especially for coherent and fine-grained geometry reconstruction. The previous learning-based online 3D reconstruction approaches with neural implicit representations have shown a promising ability for coherent scene reconstruction, but often fail to consistently reconstruct fine-grained geometric details during online reconstruction. This paper presents a new on-the-fly monocular 3D reconstruction approach, named GP-Recon, to perform high-fidelity online neural 3D reconstruction with fine-grained geometric details. We incorporate geometric prior (GP) into a scene's neural geometry learning to better capture its geometric details and, more importantly, propose an online volume rendering optimization to reconstruct and maintain geometric details during the online reconstruction task. The extensive comparisons with state-of-the-art approaches show that our GP-Recon consistently generates more accurate and complete reconstruction results with much better fine-grained details, both quantitatively and qualitatively.

4.
IEEE Trans Image Process ; 32: 4046-4058, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37440403

RESUMO

We present Skeleton-CutMix, a simple and effective skeleton augmentation framework for supervised domain adaptation and show its advantage in skeleton-based action recognition tasks. Existing approaches usually perform domain adaptation for action recognition with elaborate loss functions that aim to achieve domain alignment. However, they fail to capture the intrinsic characteristics of skeleton representation. Benefiting from the well-defined correspondence between bones of a pair of skeletons, we instead mitigate domain shift by fabricating skeleton data in a mixed domain, which mixes up bones from the source domain and the target domain. The fabricated skeletons in the mixed domain can be used to augment training data and train a more general and robust model for action recognition. Specifically, we hallucinate new skeletons by using pairs of skeletons from the source and target domains; a new skeleton is generated by exchanging some bones from the skeleton in the source domain with corresponding bones from the skeleton in the target domain, which resembles a cut-and-mix operation. When exchanging bones from different domains, we introduce a class-specific bone sampling strategy so that bones that are more important for an action class are exchanged with higher probability when generating augmentation samples for that class. We show experimentally that the simple bone exchange strategy for augmentation is efficient and effective and that distinctive motion features are preserved while mixing both action and style across domains. We validate our method in cross-dataset and cross-age settings on NTU-60 and ETRI-Activity3D datasets with an average gain of over 3% in terms of action recognition accuracy, and demonstrate its superior performance over previous domain adaptation approaches as well as other skeleton augmentation strategies.


Assuntos
Esqueleto , Movimento (Física)
5.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 5436-5447, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-36197869

RESUMO

Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This article proposes a novel attention mechanism which we call external attention, based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.

6.
IEEE Trans Image Process ; 32: 6401-6412, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37976196

RESUMO

This paper presents a Semantic Positioning System (SPS) to enhance the accuracy of mobile device geo-localization in outdoor urban environments. Although the traditional Global Positioning System (GPS) can offer a rough localization, it lacks the necessary accuracy for applications such as Augmented Reality (AR). Our SPS integrates Geographic Information System (GIS) data, GPS signals, and visual image information to estimate the 6 Degree-of-Freedom (DoF) pose through cross-view semantic matching. This approach has excellent scalability to support GIS context with Levels of Detail (LOD). The map data representation is Digital Elevation Model (DEM), a cost-effective aerial map that allows for fast deployment for large-scale areas. However, the DEM lacks geometric and texture details, making it challenging for traditional visual feature extraction to establish pixel/voxel level cross-view correspondences. To address this, we sample observation pixels from the query ground-view image using predicted semantic labels. We then propose an iterative homography estimation method with semantic correspondences. To improve the efficiency of the overall system, we further employ a heuristic search to speedup the matching process. The proposed method is robust, real-time, and automatic. Quantitative experiments on the challenging Bund dataset show that we achieve a positioning accuracy of 73.24%, surpassing the baseline skyline-based method by 20%. Compared with the state-of-the-art semantic-based approach on the Kitti dataset, we improve the positioning accuracy by an average of 5%.

7.
IEEE Trans Vis Comput Graph ; 29(12): 5523-5537, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36251891

RESUMO

Selecting views is one of the most common but overlooked procedures in topics related to 3D scenes. Typically, existing applications and researchers manually select views through a trial-and-error process or "preset" a direction, such as the top-down views. For example, literature for scene synthesis requires views for visualizing scenes. Research on panorama and VR also require initial placements for cameras, etc. This article presents SceneViewer, an integrated system for automatic view selections. Our system is achieved by applying rules of interior photography, which guides potential views and seeks better views. Through experiments and applications, we show the potentiality and novelty of the proposed method.

8.
IEEE Trans Vis Comput Graph ; 29(12): 5124-5136, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36194712

RESUMO

View synthesis methods using implicit continuous shape representations learned from a set of images, such as the Neural Radiance Field (NeRF) method, have gained increasing attention due to their high quality imagery and scalability to high resolution. However, the heavy computation required by its volumetric approach prevents NeRF from being useful in practice; minutes are taken to render a single image of a few megapixels. Now, an image of a scene can be rendered in a level-of-detail manner, so we posit that a complicated region of the scene should be represented by a large neural network while a small neural network is capable of encoding a simple region, enabling a balance between efficiency and quality. Recursive-NeRF is our embodiment of this idea, providing an efficient and adaptive rendering and training approach for NeRF. The core of Recursive-NeRF learns uncertainties for query coordinates, representing the quality of the predicted color and volumetric intensity at each level. Only query coordinates with high uncertainties are forwarded to the next level to a bigger neural network with a more powerful representational capability. The final rendered image is a composition of results from neural networks of all levels. Our evaluation on public datasets and a large-scale scene dataset we collected shows that Recursive-NeRF is more efficient than NeRF while providing state-of-the-art quality. The code will be available at https://github.com/Gword/Recursive-NeRF.

9.
Artigo em Inglês | MEDLINE | ID: mdl-37028344

RESUMO

Deep neural networks (DNNs) have been widely used for mesh processing in recent years. However, current DNNs can not process arbitrary meshes efficiently. On the one hand, most DNNs expect 2-manifold, watertight meshes, but many meshes, whether manually designed or automatically generated, may have gaps, non-manifold geometry, or other defects. On the other hand, the irregular structure of meshes also brings challenges to building hierarchical structures and aggregating local geometric information, which is critical to conduct DNNs. In this paper, we present DGNet, an efficient, effective and generic deep neural mesh processing network based on dual graph pyramids; it can handle arbitrary meshes. Firstly, we construct dual graph pyramids for meshes to guide feature propagation between hierarchical levels for both downsampling and upsampling. Secondly, we propose a novel convolution to aggregate local features on the proposed hierarchical graphs. By utilizing both geodesic neighbors and Euclidean neighbors, the network enables feature aggregation both within local surface patches and between isolated mesh components. Experimental results demonstrate that DGNet can be applied to both shape analysis and large-scale scene understanding. Furthermore, it achieves superior performance on various benchmarks, including ShapeNetCore, HumanBody, ScanNet and Matterport3D. Code and models will be available at https://github.com/li-xl/DGNet.

10.
IEEE Trans Image Process ; 32: 6413-6425, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37906473

RESUMO

Objects in aerial images show greater variations in scale and orientation than in other images, making them harder to detect using vanilla deep convolutional neural networks. Networks with sampling equivariance can adapt sampling from input feature maps to object transformation, allowing a convolutional kernel to extract effective object features under different transformations. However, methods such as deformable convolutional networks can only provide sampling equivariance under certain circumstances, as they sample by location. We propose sampling equivariant self-attention networks, which treat self-attention restricted to a local image patch as convolution sampling by masks instead of locations, and a transformation embedding module to improve the equivariant sampling further. We further propose a novel randomized normalization module to enhance network generalization and a quantitative evaluation metric to fairly evaluate the ability of sampling equivariance of different models. Experiments show that our model provides significantly better sampling equivariance than existing methods without additional supervision and can thus extract more effective image features. Our model achieves state-of-the-art results on the DOTA-v1.0, DOTA-v1.5, and HRSC2016 datasets without additional computations or parameters.

11.
IEEE Trans Vis Comput Graph ; 28(4): 1745-1757, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-33001804

RESUMO

Accurate camera pose estimation is essential and challenging for real world dynamic 3D reconstruction and augmented reality applications. In this article, we present a novel RGB-D SLAM approach for accurate camera pose tracking in dynamic environments. Previous methods detect dynamic components only across a short time-span of consecutive frames. Instead, we provide a more accurate dynamic 3D landmark detection method, followed by the use of long-term consistency via conditional random fields, which leverages long-term observations from multiple frames. Specifically, we first introduce an efficient initial camera pose estimation method based on distinguishing dynamic from static points using graph-cut RANSAC. These static/dynamic labels are used as priors for the unary potential in the conditional random fields, which further improves the accuracy of dynamic 3D landmark detection. Evaluation using the TUM and Bonn RGB-D dynamic datasets shows that our approach significantly outperforms state-of-the-art methods, providing much more accurate camera trajectory estimation in a variety of highly dynamic environments. We also show that dynamic 3D reconstruction can benefit from the camera poses estimated by our RGB-D SLAM approach.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA