Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Appl Opt ; 55(21): 5539-47, 2016 Jul 20.
Artigo em Inglês | MEDLINE | ID: mdl-27463902

RESUMO

This paper proposes a rapid and automatic measurement system to acquire a 3D shape of a human body. A flexible calibration method was developed to decrease the complexity in system calibration. To reduce the computation cost, a GPU-Steger line detector was proposed to more rapidly detect the center of the laser pattern and at subpixel level. The processing time of line detection is significantly shortened by the GPU-Steger line detector, which can be over 110 times faster than that by CPU. The key technologies are introduced, and the experimental results are presented in this paper to illustrate the performance of the proposed system. The system can be used to measure human body surfaces with nonuniform reflectance such as hair, skin, and clothes with rich texture.

2.
Artigo em Inglês | MEDLINE | ID: mdl-39321010

RESUMO

Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

3.
Artigo em Inglês | MEDLINE | ID: mdl-38861430

RESUMO

In this paper, we formally address universal object detection, which aims to detect every category in every scene. The dependence on human annotations, the limited visual information, and the novel categories in open world severely restrict the universality of detectors. We propose UniDetector, a universal object detector that recognizes enormous categories in the open world. The critical points for UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces in training through image-text alignment, which guarantees sufficient information for universal representations. 2) it involves heterogeneous supervision training, which alleviates the dependence on the limited fully-labeled images. 3) it generalizes to open world easily while keeping the balance between seen and unseen classes. 4) it further promotes generalizing to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot ability on large-vocabulary datasets - it surpasses supervised baselines by more than 5% without seeing any corresponding images. On 13 detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3% amount of training data.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 9241-9247, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37015401

RESUMO

The computational complexity of transformers limits it to be widely deployed onto frameworks for visual recognition. Recent work Dosovitskiy et al. 2021 significantly accelerates the network processing speed by reducing the resolution at the beginning of the network, however, it is still hard to be directly generalized onto other downstream tasks e.g.object detection and segmentation like CNN. In this paper, we present a transformer-based architecture retaining both the local and global interactions within the network, and can be transferable to other downstream tasks. The proposed architecture reforms the original full spatial self-attention into pixel-wise local attention and patch-wise global attention. Such factorization saves the computational cost while retaining the information of different granularities, which helps generate multi-scale features required by different tasks. By exploiting the factorized attention, we construct a Separable Transformer (SeT) for visual modeling. Experimental results show that SeT outperforms the previous state-of-the-art transformer-based approaches and its CNN counterparts on three major tasks including image classification, object detection and instance segmentation.1.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8743-8756, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37015515

RESUMO

We introduce a new image segmentation task, called Entity Segmentation (ES), which aims to segment all visual entities (objects and stuffs) in an image without predicting their semantic labels. By removing the need of class label prediction, the models trained for such task can focus more on improving segmentation quality. It has many practical applications such as image manipulation and editing where the quality of segmentation masks is crucial but class labels are less important. We conduct the first-ever study to investigate the feasibility of convolutional center-based representation to segment things and stuffs in a unified manner, and show that such representation fits exceptionally well in the context of ES. More specifically, we propose a CondInst-like fully-convolutional architecture with two novel modules specifically designed to exploit the class-agnostic and non-overlapping requirements of ES. Experiments show that the models designed and trained for ES significantly outperforms popular class-specific panoptic segmentation models in terms of segmentation quality. Moreover, an ES model can be easily trained on a combination of multiple datasets without the need to resolve label conflicts in dataset merging, and the model trained for ES on one or more datasets can generalize very well to other test datasets of unseen domains. The code has been released at https://github.com/dvlab-research/Entity.

6.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 1372-1387, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35294341

RESUMO

Strong semantic segmentation models require large backbones to achieve promising performance, making it hard to adapt to real applications where effective real-time algorithms are needed. Knowledge distillation tackles this issue by letting the smaller model (student) produce similar pixel-wise predictions to that of a larger model (teacher). However, the classifier, which can be deemed as the perspective by which models perceive the encoded features for yielding observations (i.e., predictions), is shared by all training samples, fitting a universal feature distribution. Since good generalization to the entire distribution may bring the inferior specification to individual samples with a certain capacity, the shared universal perspective often overlooks details existing in each sample, causing degradation of knowledge distillation. In this paper, we propose Adaptive Perspective Distillation (APD) that creates an adaptive local perspective for each individual training sample. It extracts detailed contextual information from each training sample specifically, mining more details from the teacher and thus achieving better knowledge distillation results on the student. APD has no structural constraints to both teacher and student models, thus generalizing well to different semantic segmentation models. Extensive experiments on Cityscapes, ADE20K, and PASCAL-Context manifest the effectiveness of our proposed APD. Besides, APD can yield favorable performance gain to the models in both object detection and instance segmentation without bells and whistles.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 4552-4568, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-35994543

RESUMO

In this paper, we present a conceptually simple, strong, and efficient framework for fully- and weakly-supervised panoptic segmentation, called Panoptic FCN. Our approach aims to represent and predict foreground things and background stuff in a unified fully convolutional pipeline, which can be optimized with point-based fully or weak supervision. In particular, Panoptic FCN encodes each object instance or stuff category with the proposed kernel generator and produces the prediction by convolving the high-resolution feature directly. With this approach, instance-aware and semantically consistent properties for things and stuff can be respectively satisfied in a simple generate-kernel-then-segment workflow. Without extra boxes for localization or instance separation, the proposed approach outperforms the previous box-based and -free models with high efficiency. Furthermore, we propose a new form of point-based annotation for weakly-supervised panoptic segmentation. It only needs several random points for both things and stuff, which dramatically reduces the annotation cost of human. The proposed Panoptic FCN is also proved to have much superior performance in this weakly-supervised setting, which achieves 82% of the fully-supervised performance with only 20 randomly annotated points per instance. Extensive experiments demonstrate the effectiveness and efficiency of Panoptic FCN on COCO, VOC 2012, Cityscapes, and Mapillary Vistas datasets. And it sets up a new leading benchmark for both fully- and weakly-supervised panoptic segmentation.

8.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 1050-1065, 2022 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-32750843

RESUMO

State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results and hardly work on unseen classes without fine-tuning. Few-shot segmentation is thus proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information of training classes and spatial inconsistency between query and support targets. To alleviate these issues, we propose the Prior Guided Feature Enrichment Network (PFENet). It consists of novel designs of (1) a training-free prior mask generation method that not only retains generalization power but also improves model performance and (2) Feature Enrichment Module (FEM) that overcomes spatial inconsistency by adaptively enriching query features with support features and prior masks. Extensive experiments on PASCAL-5 i and COCO prove that the proposed prior generation method and FEM both improve the baseline method significantly. Our PFENet also outperforms state-of-the-art methods by a large margin without efficiency loss. It is surprising that our model even generalizes to cases without labeled support samples.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA