Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 5149-5156, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38329852

RESUMO

One-shot skeleton action recognition, which aims to learn a skeleton action recognition model with a single training sample, has attracted increasing interest due to the challenge of collecting and annotating large-scale skeleton action data. However, most existing studies match skeleton sequences by comparing their feature vectors directly which neglects spatial structures and temporal orders of skeleton data. This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching. We represent skeleton data at multiple spatial and temporal scales and achieve optimal feature matching from two perspectives. The first is multi-scale matching which captures the scale-wise semantic relevance of skeleton data at multiple spatial and temporal scales simultaneously. The second is cross-scale matching which handles different motion magnitudes and speeds by capturing sample-wise relevance across multiple scales. Extensive experiments over three large-scale datasets (NTU RGB+D, NTU RGB+D 120, and PKU-MMD) show that our method achieves superior one-shot skeleton action recognition, and outperforms SOTA consistently by large margins.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 12192-12205, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37318980

RESUMO

In this article, we investigate the problem of panoramic image reflection removal to relieve the content ambiguity between the reflection layer and the transmission scene. Although a partial view of the reflection scene is attainable in the panoramic image and provides additional information for reflection removal, it is not trivial to directly apply this for getting rid of undesired reflections due to its misalignment with the reflection-contaminated image. We propose an end-to-end framework to tackle this problem. By resolving misalignment issues with adaptive modules, the high-fidelity recovery of reflection layer and transmission scenes is accomplished. We further propose a new data generation approach that considers the physics-based formation model of mixture images and the in-camera dynamic range clipping to diminish the domain gap between synthetic and real data. Experimental results demonstrate the effectiveness of the proposed method and its applicability for mobile devices and industrial applications.

3.
Artigo em Inglês | MEDLINE | ID: mdl-37027772

RESUMO

Learning the generalizable feature representation is critical to few-shot image classification. While recent works exploited task-specific feature embedding using meta-tasks for few-shot learning, they are limited in many challenging tasks as being distracted by the excursive features such as the background, domain, and style of the image samples. In this work, we propose a novel disentangled feature representation (DFR) framework, dubbed DFR, for few-shot learning applications. DFR can adaptively decouple the discriminative features that are modeled by the classification branch, from the class-irrelevant component of the variation branch. In general, most of the popular deep few-shot learning methods can be plugged in as the classification branch, thus DFR can boost their performance on various few-shot tasks. Furthermore, we propose a novel FS-DomainNet dataset based on DomainNet, for benchmarking the few-shot domain generalization (DG) tasks. We conducted extensive experiments to evaluate the proposed DFR on general, fine-grained, and cross-domain few-shot classification, as well as few-shot DG, using the corresponding four benchmarks, i.e., mini-ImageNet, tiered-ImageNet, Caltech-UCSD Birds 200-2011 (CUB), and the proposed FS-DomainNet. Thanks to the effective feature disentangling, the DFR-based few-shot classifiers achieved state-of-the-art results on all datasets.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 1424-1441, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35439129

RESUMO

Reflection removal has been discussed for more than decades. This paper aims to provide the analysis for different reflection properties and factors that influence image formation, an up-to-date taxonomy for existing methods, a benchmark dataset, and the unified benchmarking evaluations for state-of-the-art (especially learning-based) methods. Specifically, this paper presents a SIngle-image Reflection Removal Plus dataset "SIR 2+ " with the new consideration for in-the-wild scenarios and glass with diverse color and unplanar shapes. We further perform quantitative and visual quality comparisons for state-of-the-art single-image reflection removal algorithms. Open problems for improving reflection removal algorithms are discussed at the end. Our dataset and follow-up update can be found at https://reflectionremoval.github.io/sir2data/.

5.
Artigo em Inglês | MEDLINE | ID: mdl-35786550

RESUMO

In heavy rain video, rain streak and rain accumulation are the most common causes of degradation. They occlude background information and can significantly impair the visibility. Most existing methods rely heavily on the synthetic training data, and thus raise the domain gap problem that prevents the trained models from performing adequately in real testing cases. Unlike these methods, we introduce a self-learning method to remove both rain streaks and rain accumulation without using any ground-truth clean images in training our model, which consequently can alleviate the domain gap issue. The main idea is based on the assumptions that (1) adjacent clean frames can be aligned or warped from one frame to another frame, (2) rain streaks are distributed randomly in the temporal domain, (3) the rain streak/accumulation related variables/priors can be inferred reliably from the information within the images/sequences. Based on these assumptions, we construct an augmented Self-Learned Deraining Network (SLDNet+) to remove both rain streaks and rain accumulation by utilizing temporal correlation, consistency, and rain-related priors. For the temporal correlation, our SLDNet+ takes rain degraded adjacent frames as its input, aligns them, and learns to predict the clean version of the current frame. For the temporal consistency, a new loss is designed to build a robust mapping between the predicted clean frame and non-rain regions from the adjacent rain frames. For the rain-streak-related prior, the rain streak removal network is optimized jointly with motion estimation and rain region detection; while for the rain-accumulation-related prior, a novel non-local video rain accumulation removal method is developed to estimate the accumulation-lines from the whole input video and to offer better color constancy and temporal smoothness. Extensive experiments show the effectiveness of our approach, which provides superior results compared with the existing state of the art methods both quantitatively and qualitatively. The source code will be made publicly available at: https://github.com/flyywh/CVPR-2020-Self-Rain-Removal-Journal.

6.
Micromachines (Basel) ; 13(4)2022 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-35457869

RESUMO

X-ray imaging machines are widely used in border control checkpoints or public transportation, for luggage scanning and inspection. Recent advances in deep learning enabled automatic object detection of X-ray imaging results to largely reduce labor costs. Compared to tasks on natural images, object detection for X-ray inspection are typically more challenging, due to the varied sizes and aspect ratios of X-ray images, random locations of the small target objects within the redundant background region, etc. In practice, we show that directly applying off-the-shelf deep learning-based detection algorithms for X-ray imagery can be highly time-consuming and ineffective. To this end, we propose a Task-Driven Cropping scheme, dubbed TDC, for improving the deep image detection algorithms towards efficient and effective luggage inspection via X-ray images. Instead of processing the whole X-ray images for object detection, we propose a two-stage strategy, which first adaptively crops X-ray images and only preserves the task-related regions, i.e., the luggage regions for security inspection. A task-specific deep feature extractor is used to rapidly identify the importance of each X-ray image pixel. Only the regions that are useful and related to the detection tasks are kept and passed to the follow-up deep detector. The varied-scale X-ray images are thus reduced to the same size and aspect ratio, which enables a more efficient deep detection pipeline. Besides, to benchmark the effectiveness of X-ray image detection algorithms, we propose a novel dataset for X-ray image detection, dubbed SIXray-D, based on the popular SIXray dataset. In SIXray-D, we provide the complete and more accurate annotations of both object classes and bounding boxes, which enables model training for supervised X-ray detection methods. Our results show that our proposed TDC algorithm can effectively boost popular detection algorithms, by achieving better detection mAPs or reducing the run time.

7.
IEEE Trans Pattern Anal Mach Intell ; 44(3): 1289-1303, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32870783

RESUMO

The deep learning based approaches which have been repeatedly proven to bring benefits to visual recognition tasks usually make a strong assumption that the training and test data are drawn from similar feature spaces and distributions. However, such an assumption may not always hold in various practical application scenarios on visual recognition tasks. Inspired by the hierarchical organization of deep feature representation that progressively leads to more abstract features at higher layers of representations, we propose to tackle this problem with a novel feature learning framework, which is called GMFAD, with better generalization capability in a multilayer perceptron manner. We first learn feature representations at the shallow layer where shareable underlying factors among domains (e.g., a subset of which could be relevant for each particular domain) can be explored. In particular, we propose to align the domain divergence between domain pair(s) by considering both inter-dimension and inter-sample correlations, which have been largely ignored by many cross-domain visual recognition methods. Subsequently, to learn more abstract information which could further benefit transferability, we propose to conduct feature disentanglement at the deep feature layer. Extensive experiments based on different visual recognition tasks demonstrate that our proposed framework can learn better transferable feature representation compared with state-of-the-art baselines.

8.
IEEE Trans Image Process ; 30: 1596-1607, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33382653

RESUMO

With the assistance of sophisticated training methods applied to single labeled datasets, the performance of fully-supervised person re-identification (Person Re-ID) has been improved significantly in recent years. However, these models trained on a single dataset usually suffer from considerable performance degradation when applied to videos of a different camera network. To make Person Re-ID systems more practical and scalable, several cross-dataset domain adaptation methods have been proposed, which achieve high performance without the labeled data from the target domain. However, these approaches still require the unlabeled data of the target domain during the training process, making them impractical. A practical Person Re-ID system pre-trained on other datasets should start running immediately after deployment on a new site without having to wait until sufficient images or videos are collected and the pre-trained model is tuned. To serve this purpose, in this paper, we reformulate person re-identification as a multi-dataset domain generalization problem. We propose a multi-dataset feature generalization network (MMFA-AAE), which is capable of learning a universal domain-invariant feature representation from multiple labeled datasets and generalizing it to 'unseen' camera systems. The network is based on an adversarial auto-encoder to learn a generalized domain-invariant latent feature representation with the Maximum Mean Discrepancy (MMD) measure to align the distributions across multiple domains. Extensive experiments demonstrate the effectiveness of the proposed method. Our MMFA-AAE approach not only outperforms most of the domain generalization Person Re-ID methods, but also surpasses many state-of-the-art supervised methods and unsupervised domain adaptation methods by a large margin.


Assuntos
Identificação Biométrica/métodos , Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina , Algoritmos , Humanos , Gravação em Vídeo
9.
IEEE Trans Pattern Anal Mach Intell ; 42(2): 494-501, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-30676946

RESUMO

In this paper, a feature boosting network is proposed for estimating 3D hand pose and 3D body pose from a single RGB image. In this method, the features learned by the convolutional layers are boosted with a new long short-term dependence-aware (LSTD) module, which enables the intermediate convolutional feature maps to perceive the graphical long short-term dependency among different hand (or body) parts using the designed Graphical ConvLSTM. Learning a set of features that are reliable and discriminatively representative of the pose of a hand (or body) part is difficult due to the ambiguities, texture and illumination variation, and self-occlusion in the real application of 3D pose estimation. To improve the reliability of the features for representing each body part and enhance the LSTD module, we further introduce a context consistency gate (CCG) in this paper, with which the convolutional feature maps are modulated according to their consistency with the context representations. We evaluate the proposed method on challenging benchmark datasets for 3D hand pose estimation and 3D full body pose estimation. Experimental results show the effectiveness of our method that achieves state-of-the-art performance on both of the tasks.


Assuntos
Mãos/diagnóstico por imagem , Imageamento Tridimensional/métodos , Aprendizado de Máquina , Postura/fisiologia , Humanos , Reprodutibilidade dos Testes
10.
IEEE Trans Pattern Anal Mach Intell ; 42(6): 1453-1467, 2020 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-30762531

RESUMO

Action prediction is to recognize the class label of an ongoing activity when only a part of it is observed. In this paper, we focus on online action prediction in streaming 3D skeleton sequences. A dilated convolutional network is introduced to model the motion dynamics in temporal dimension via a sliding window over the temporal axis. Since there are significant temporal scale variations in the observed part of the ongoing action at different time steps, a novel window scale selection method is proposed to make our network focus on the performed part of the ongoing action and try to suppress the possible incoming interference from the previous actions at each step. An activation sharing scheme is also proposed to handle the overlapping computations among the adjacent time steps, which enables our framework to run more efficiently. Moreover, to enhance the performance of our framework for action prediction with the skeletal input data, a hierarchy of dilated tree convolutions are also designed to learn the multi-level structured semantic representations over the skeleton joints at each frame. Our proposed approach is evaluated on four challenging datasets. The extensive experiments demonstrate the effectiveness of our method for skeleton-based online action prediction.

11.
IEEE Trans Pattern Anal Mach Intell ; 42(10): 2684-2701, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-31095476

RESUMO

Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding.


Assuntos
Aprendizado Profundo , Atividades Humanas/classificação , Processamento de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Benchmarking , Humanos , Semântica , Gravação em Vídeo
12.
IEEE Trans Pattern Anal Mach Intell ; 42(12): 2969-2982, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-31180841

RESUMO

Removing the undesired reflections from images taken through the glass is of broad application to various computer vision tasks. Non-learning based methods utilize different handcrafted priors such as the separable sparse gradients caused by different levels of blurs, which often fail due to their limited description capability to the properties of real-world reflections. In this paper, we propose a network with the feature-sharing strategy to tackle this problem in a cooperative and unified framework, by integrating image context information and the multi-scale gradient information. To remove the strong reflections existed in some local regions, we propose a statistic loss by considering the gradient level statistics between the background and reflections. Our network is trained on a new dataset with 3250 reflection images taken under diverse real-world scenes. Experiments on a public benchmark dataset show that the proposed method performs favorably against state-of-the-art methods.

13.
IEEE Trans Neural Netw Learn Syst ; 31(3): 984-996, 2020 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-31150348

RESUMO

Heterogeneous domain adaptation (HDA) aims to solve the learning problems where the source- and the target-domain data are represented by heterogeneous types of features. The existing HDA approaches based on matrix completion or matrix factorization have proven to be effective to capture shareable information between heterogeneous domains. However, there are two limitations in the existing methods. First, a large number of corresponding data instances between the source domain and the target domain are required to bridge the gap between different domains for performing matrix completion. These corresponding data instances may be difficult to collect in real-world applications due to the limited size of data in the target domain. Second, most existing methods can only capture linear correlations between features and data instances while performing matrix completion for HDA. In this paper, we address these two issues by proposing a new matrix-factorization-based HDA method in a semisupervised manner, where only a few labeled data are required in the target domain without requiring any corresponding data instances between domains. Such labeled data are more practical to obtain compared with cross-domain corresponding data instances. Our proposed algorithm is based on matrix factorization in an approximated reproducing kernel Hilbert space (RKHS), where nonlinear correlations between features and data instances can be exploited to learn heterogeneous features for both the source and the target domains. Extensive experiments are conducted on cross-domain text classification and object recognition, and experimental results demonstrate the superiority of our proposed method compared with the state-of-the-art HDA approaches.

14.
Artigo em Inglês | MEDLINE | ID: mdl-31562087

RESUMO

The recent advances of hardware technology have made the intelligent analysis equipped at the front-end with deep learning more prevailing and practical. To better enable the intelligent sensing at the front-end, instead of compressing and transmitting visual signals or the ultimately utilized top-layer deep learning features, we propose to compactly represent and convey the intermediate-layer deep learning features with high generalization capability, to facilitate the collaborating approach between front and cloud ends. This strategy enables a good balance among the computational load, transmission load and the generalization ability for cloud servers when deploying the deep neural networks for large scale cloud based visual analysis. Moreover, the presented strategy also makes the standardization of deep feature coding more feasible and promising, as a series of tasks can simultaneously benefit from the transmitted intermediate layer features. We also present the results for evaluations of both lossless and lossy deep feature compression, which provide meaningful investigations and baselines for future research and standardization activities.

15.
IEEE Trans Pattern Anal Mach Intell ; 40(12): 3007-3021, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-29990167

RESUMO

Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.

16.
Artigo em Inglês | MEDLINE | ID: mdl-29994443

RESUMO

Removing the undesired reflections in images taken through the glass is of broad application to various image processing and computer vision tasks. Existing single image based solutions heavily rely on scene priors such as separable sparse gradients caused by different levels of blur, and they are fragile when such priors are not observed. In this paper, we notice that strong reflections usually dominant a limited region in the whole image, and propose a Region-aware Reflection Removal (R3) approach by automatically detecting and heterogeneously processing regions with and without reflections. We integrate content and gradient priors to jointly achieve missing contents restoration as well as background and reflection separation in a unified optimization framework. Extensive validation using 50 sets of real data shows that the proposed method outperforms state-of-the-art on both quantitative metrics and visual qualities.

17.
IEEE Trans Image Process ; 27(5): 2201-2216, 2018 May.
Artigo em Inglês | MEDLINE | ID: mdl-29432101

RESUMO

The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of a CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of graphics processing unit (GPU). We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation as well as the memory access mechanism are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU for resolving the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which enables to leverage the advantages of GPU platforms harmoniously, and yield significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.

18.
IEEE Trans Image Process ; 27(4): 1586-1599, 2018 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-29324413

RESUMO

Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, long short-term memory (LSTM) networks have shown promising performance in this task due to their strengths in modeling the dependencies and dynamics in sequential data. As not all skeletal joints are informative for action recognition, and the irrelevant joints often bring noise which can degrade the performance, we need to pay more attention to the informative ones. However, the original LSTM network does not have explicit attention ability. In this paper, we propose a new class of LSTM network, global context-aware attention LSTM, for skeleton-based action recognition, which is capable of selectively focusing on the informative joints in each frame by using a global context memory cell. To further improve the attention capability, we also introduce a recurrent attention mechanism, with which the attention performance of our network can be enhanced progressively. Besides, a two-stream framework, which leverages coarse-grained attention and fine-grained attention, is also introduced. The proposed method achieves state-of-the-art performance on five challenging datasets for skeleton-based action recognition.


Assuntos
Atividades Humanas/classificação , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Bases de Dados Factuais , Humanos , Aprendizado de Máquina , Memória de Curto Prazo , Modelos Neurológicos
19.
IEEE Trans Image Process ; 26(12): 5867-5881, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28792895

RESUMO

Cross-domain shoe image retrieval is a challenging problem, because the query photo from the street domain (daily life scenario) and the reference photo in the online domain (online shop images) have significant visual differences due to the viewpoint and scale variation, self-occlusion, and cluttered background. This paper proposes the semantic hierarchy of attribute convolutional neural network (SHOE-CNN) with a three-level feature representation for discriminative shoe feature expression and efficient retrieval. The SHOE-CNN with its newly designed loss function systematically merges semantic attributes of closer visual appearances to prevent shoe images with the obvious visual differences being confused with each other; the features extracted from image, region, and part levels effectively match the shoe images across different domains. We collect a large-scale shoe data set composed of 14341 street domain and 12652 corresponding online domain images with fine-grained attributes to train our network and evaluate our system. The top-20 retrieval accuracy improves significantly over the solution with the pre-trained CNN features.

20.
IEEE Trans Image Process ; 25(8): 3775-86, 2016 08.
Artigo em Inglês | MEDLINE | ID: mdl-27295675

RESUMO

Distortions cause structural changes in digital images, leading to degraded visual quality. Dictionary-based sparse representation has been widely studied recently due to its ability to extract inherent image structures. Meantime, it can extract image features with slightly higher level semantics. Intuitively, sparse representation can be used for image quality assessment, because visible distortions can cause significant changes to the sparse features. In this paper, a new sparse representation-based image quality assessment model is proposed based on the construction of adaptive sub-dictionaries. An overcomplete dictionary trained from natural images is employed to capture the structure changes between the reference and distorted images by sparse feature extraction via adaptive sub-dictionary selection. Based on the observation that image sparse features are invariant to weak degradations and the perceived image quality is generally influenced by diverse issues, three auxiliary quality features are added, including gradient, color, and luminance information. The proposed method is not sensitive to training images, so a universal dictionary can be adopted for quality evaluation. Extensive experiments on five public image quality databases demonstrate that the proposed method produces the state-of-the-art results, and it delivers consistently well performances when tested in different image quality databases.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...