Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4381-4397, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38227416

RESUMO

Nowadays, pre-training big models on large-scale datasets has achieved great success and dominated many downstream tasks in natural language processing and 2D vision, while pre-training in 3D vision is still under development. In this paper, we provide a new perspective of transferring the pre-trained knowledge from 2D domain to 3D domain with Point-to-Pixel Prompting in data space and Pixel-to-Point distillation in feature space, exploiting shared knowledge in images and point clouds that display the same visual world. Following the principle of prompting engineering, Point-to-Pixel Prompting transforms point clouds into colorful images with geometry-preserved projection and geometry-aware coloring. Then the pre-trained image models can be directly implemented for point cloud tasks without structural changes or weight modifications. With projection correspondence in feature space, Pixel-to-Point distillation further regards pre-trained image models as the teacher model and distills pre-trained 2D knowledge to student point cloud models, remarkably enhancing inference efficiency and model capacity for point cloud analysis. We conduct extensive experiments in both object classification and scene segmentation under various settings to demonstrate the superiority of our method. In object classification, we reveal the important scale-up trend of Point-to-Pixel Prompting and attain 90.3% accuracy on ScanObjectNN dataset, surpassing previous literature by a large margin. In scene-level semantic segmentation, our method outperforms traditional 3D analysis approaches and shows competitive capacity in dense prediction tasks.

2.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 2518-2532, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38019629

RESUMO

In this paper, we present a new framework named DIML to achieve more interpretable deep metric learning. Unlike traditional deep metric learning method that simply produces a global similarity given two images, DIML computes the overall similarity through the weighted sum of multiple local part-wise similarities, making it easier for human to understand the mechanism of how the model distinguish two images. Specifically, we propose a structural matching strategy that explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images. We also devise a multi-scale matching strategy, which considers both global and local similarities and can significantly reduce the computational costs in the application of image retrieval. To handle the view variance in some complicated scenarios, we propose to use cross-correlation as the marginal distribution of the optimal transport to leverage semantic information to locate the important region in the images. Our framework is model-agnostic, which can be applied to off-the-shelf backbone networks and metric learning methods. To extend our DIML to more advanced architectures like vision Transformers (ViTs), we further propose truncated attention rollout and partial similarity to overcome the lack of locality in ViTs. We evaluate our method on three major benchmarks of deep metric learning including CUB200-2011, Cars196, and Stanford Online Products, and achieve substantial improvements over popular metric learning methods with better interpretability.

3.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 14114-14130, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37924200

RESUMO

In this paper, we propose a Transformer encoder-decoder architecture, called PoinTr, which reformulates point cloud completion as a set-to-set translation problem and employs a geometry-aware block to model local geometric relationships explicitly. The migration of Transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Taking a step towards more complicated and diverse situations, we further propose AdaPoinTr by developing an adaptive query generation mechanism and designing a novel denoising task during completing a point cloud. Coupling these two techniques enables us to train the model efficiently and effectively: we reduce training time (by 15x or more) and improve completion performance (over 20%). Additionally, we propose two more challenging benchmarks with more diverse incomplete point clouds that can better reflect real-world scenarios to promote future research. We also show our method can be extended to the scene-level point cloud completion scenario by designing a new geometry-enhanced semantic scene completion framework. Extensive experiments on the existing and newly-proposed datasets demonstrate the effectiveness of our method, which attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI, surpassing other work by a large margin and establishing new state-of-the-arts on various benchmarks. Most notably, AdaPoinTr can achieve such promising performance with higher throughputs and fewer FLOPs compared with the previous best methods in practice.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13621-13635, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37432799

RESUMO

In this paper, we propose Point-Voxel Correlation Fields to explore relations between two consecutive point clouds and estimate scene flow that represents 3D motions. Most existing works only consider local correlations, which are able to handle small movements but fail when there are large displacements. Therefore, it is essential to introduce all-pair correlation volumes that are free from local neighbor restrictions and cover both short- and long-term dependencies. However, it is challenging to efficiently extract correlation features from all-pairs fields in the 3D space, given the irregular and unordered nature of point clouds. To tackle this problem, we present point-voxel correlation fields, proposing distinct point and voxel branches to inquire about local and long-range correlations from all-pair fields respectively. To exploit point-based correlations, we adopt the K-Nearest Neighbors search that preserves fine-grained information in the local region, which guarantees the scene flow estimation precision. By voxelizing point clouds in a multi-scale manner, we construct pyramid correlation voxels to model long-range correspondences, which are utilized to handle fast-moving objects. Integrating these two types of correlations, we propose Point-Voxel Recurrent All-Pairs Field Transforms (PV-RAFT) architecture that employs an iterative scheme to estimate scene flow from point clouds. To adapt to different flow scope conditions and obtain more fine-grained results, we further propose Deformable PV-RAFT (DPV-RAFT), where the Spatial Deformation deforms the voxelized neighborhood, and the Temporal Deformation controls the iterative update process. We evaluate the proposed method on the FlyingThings3D and KITTI Scene Flow 2015 datasets and experimental results show that we outperform state-of-the-art methods by remarkable margins.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10960-10973, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37030707

RESUMO

Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision Transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform. Based on this basic design, we develop a series of isotropic models with a Transformer-style simple architecture and CNN-style hierarchical models with better performance. Isotropic GFNet models exhibit favorable accuracy/complexity trade-offs compared to recent vision Transformers and pure MLP models. Hierarchical GFNet models can inherit successful designs in CNNs and be easily scaled up with larger model sizes and more training data, showing strong performance on both image classification (e.g., 85.0% top-1 accuracy on ImageNet-1 k without any extra data or supervision, and 87.4% accuracy with ImageNet-21 k pre-training) and dense prediction tasks (e.g., 54.3 mIoU on ADE20 k val). Our results demonstrate that GFNet can be a very competitive alternative to Transformer-based models and CNNs in terms of efficiency, generalization ability and robustness. Code is available at https://github.com/raoyongming/GFNet.

6.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10883-10897, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37030709

RESUMO

In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative regions, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find that the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks. To handle structured feature maps, we formulate a generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and expressive slow paths to important locations, we can maintain the complete structure of feature maps while significantly reducing the overall computations. Extensive experiments on diverse modern architectures and different visual tasks demonstrate the effectiveness of our proposed framework. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%  âˆ¼  35% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision Transformers. By introducing asymmetric computation, a similar acceleration can be achieved on modern CNNs and Swin Transformers. Moreover, our method achieves promising results on more complex tasks including semantic segmentation and object detection. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2193-2207, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35294344

RESUMO

This work explores the use of global and local structures of 3D point clouds as a free and powerful supervision signal for representation learning. Local and global patterns of a 3D object are closely related. Although each part of an object is incomplete, the underlying attributes about the object are shared among all parts, which makes reasoning about the whole object from a single part possible. We hypothesize that a powerful representation of a 3D object should model the attributes that are shared between parts and the whole object, and distinguishable from other objects. Based on this hypothesis, we propose a new framework to learn point cloud representations by bidirectional reasoning between the local structures at different abstraction hierarchies and the global shape. Moreover, we extend the unsupervised structural representation learning method to more complex 3D scenes. By introducing structural proxies as the intermediate-level representations between local and global ones, we propose a hierarchical reasoning scheme among local parts, structural proxies, and the overall point cloud to learn powerful 3D representations in an unsupervised manner. Extensive experimental results demonstrate that the unsupervised representations can be very competitive alternatives of supervised representations in discriminative power, and exhibit better performance in generalization ability and robustness. Our method establishes the new state-of-the-art of unsupervised/few-shot 3D object classification and part segmentation. We also show our method can serve as a simple yet effective regime for model pre-training on 3D scene segmentation and detection tasks. We expect our observations to offer a new perspective on learning better representations from data structures instead of human annotations for point cloud understanding.

8.
IEEE Trans Image Process ; 31: 6048-6061, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36103440

RESUMO

In this paper, we investigate the problem of abductive visual reasoning (AVR), which requires vision systems to infer the most plausible explanation for visual observations. Unlike previous work which performs visual reasoning on static images or synthesized scenes, we exploit long-term reasoning from instructional videos that contain a wealth of detailed information about the physical world. We conceptualize two tasks for this emerging and challenging topic. The primary task is AVR, which is based on the initial configuration and desired goal from an instructional video, and the model is expected to figure out what is the most plausible sequence of steps to achieve the goal. In order to avoid trivial solutions based on appearance information rather than reasoning, the second task called AVR++ is constructed, which requires the model to answer why the unselected options are less plausible. We introduce a new dataset called VideoABC, which consists of 46,354 unique steps derived from 11,827 instructional videos, formulated as 13,526 abductive reasoning questions with an average reasoning duration of 51 seconds. Through an adversarial hard hypothesis mining algorithm, non-trivial and high-quality problems are generated efficiently and effectively. To achieve human-level reasoning, we propose a Hierarchical Dual Reasoning Network (HDRNet) to capture the long-term dependencies among steps and observations. We establish a benchmark for abductive visual reasoning, and our method set state-of-the-arts on AVR (  âˆ¼ 74 %) and AVR++ (  âˆ¼ 45 %), and humans can easily achieve over 90% accuracy on these two tasks. The large performance gap reveals the limitation of current video understanding models on temporal reasoning and leaves substantial room for future research on this challenging problem. Our dataset and code are available at https://github.com/wl-zhao/VideoABC.


Assuntos
Algoritmos , Conjuntos de Dados como Assunto , Humanos
9.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 7898-7911, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34550879

RESUMO

Structures matter in single image super-resolution (SISR). Benefiting from generative adversarial networks (GANs), recent studies have promoted the development of SISR by recovering photo-realistic images. However, there are still undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super-resolution (SPSR) method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. First, we propose SPSR with gradient guidance (SPSR-G) by exploiting gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss to impose a second-order restriction on the super-resolved images, which helps generative networks concentrate more on geometric structures. Second, since the gradient maps are handcrafted and may only be able to capture limited aspects of structural information, we further extend SPSR-G by introducing a learnable neural structure extractor (NSE) to unearth richer local structures and provide stronger supervision for SR. We propose two self-supervised structure learning methods, contrastive prediction and solving jigsaw puzzles, to train the NSEs. Our methods are model-agnostic, which can be potentially used for off-the-shelf SR networks. Experimental results on five benchmark datasets show that the proposed methods outperform state-of-the-art perceptual-driven SR methods under LPIPS, PSNR, and SSIM metrics. Visual results demonstrate the superiority of our methods in restoring structures while generating natural SR images. Code is available at https://github.com/Maclory/SPSR.

10.
IEEE Trans Pattern Anal Mach Intell ; 41(10): 2291-2304, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30371355

RESUMO

In this paper, we propose a generic Runtime Network Routing (RNR) framework for efficient image classification, which selects an optimal path inside the network. Unlike existing static neural network acceleration methods, our method preserves the full ability of the original large network and conducts dynamic routing at runtime according to the input image and current feature maps. The routing is performed in a bottom-up, layer-by-layer manner, where we model it as a Markov decision process and use reinforcement learning for training. The agent determines the estimated reward of each sub-path and conducts routing conditioned on different samples, where a faster path is taken when the image is easier for the task. Since the ability of network is fully preserved, the balance point is easily adjustable according to the available resources. We test our method on both multi-path residual networks and incremental convolutional channel pruning, and show that RNR consistently outperforms static methods at the same computation complexity on both the CIFAR and ImageNet datasets. Our method can also be applied to off-the-shelf neural network structures and easily extended to other application scenarios.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...