Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 51
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-38748521

RESUMO

Vision Transformers have been the most popular network architecture in visual recognition recently due to the strong ability of encode global information. However, its high computational cost when processing high-resolution images limits the applications in downstream tasks. In this paper, we take a deep look at the internal structure of self-attention and present a simple Transformer style convolutional neural network (ConvNet) for visual recognition. By comparing the design principles of the recent ConvNets and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels ( ≥ 7×7) nested in convolutional layers and we observe a consistent performance improvement when gradually increasing the kernel size from 5×5 to 21×21. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20k semantic segmentation. Our code is available at https://github.com/HVision-NKU/Conv2Former.

2.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 2506-2517, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38015699

RESUMO

Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e., pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves 85.3% top-1 accuracy on ImageNet and 52.5% mIoU on ADE20k, surpassing previous best results by 0.7% and 1.8% respectively.

3.
Artigo em Inglês | MEDLINE | ID: mdl-38090871

RESUMO

Data-dependent hashing methods aim to learn hash functions from the pairwise or triplet relationships among the data, which often lead to low efficiency and low collision rate by only capturing the local distribution of the data. To solve the limitation, we propose central similarity, in which the hash codes of similar data pairs are encouraged to approach a common center and those of dissimilar pairs to converge to different centers. As a new global similarity metric, central similarity can improve the efficiency and retrieval accuracy of hash learning. By introducing a new concept, hash centers, we principally formulate the computation of the proposed central similarity metric, in which the hash centers refer to a set of points scattered in the Hamming space with a sufficient mutual distance between each other. To construct well-separated hash centers, we provide two efficient methods: 1) leveraging the Hadamard matrix and Bernoulli distributions to generate data-independent hash centers and 2) learning data-dependent hash centers from data representations. Based on the proposed similarity metric and hash centers, we propose central similarity quantization (CSQ) that optimizes the central similarity between data points with respect to their hash centers instead of optimizing the local similarity to generate a high-quality deep hash function. We also further improve the CSQ with data-dependent hash centers, dubbed as CSQ with learnable center (CSQ [Formula: see text] ). The proposed CSQ and CSQ [Formula: see text] are generic and applicable to image and video hashing scenarios. We conduct extensive experiments on large-scale image and video retrieval tasks, and the proposed CSQ yields noticeably boosted retrieval performance, i.e., 3%-20% in mean average precision (mAP) over the previous state-of-the-art methods, which also demonstrates that our methods can generate cohesive hash codes for similar data pairs and dispersed hash codes for dissimilar pairs.

4.
Artigo em Inglês | MEDLINE | ID: mdl-37910405

RESUMO

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, by migrating our focus away from the token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and demonstrate their gratifying performance. We summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves [Formula: see text]80% accuracy on ImageNet-1 K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of [Formula: see text]81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1 K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1 K: it achieves an accuracy of 85.5% at 224 ×224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with commonly-used GELU yet achieves better performance. Specifically, StarReLU is a variant of Squared ReLU dedicated to alleviating distribution shift. We expect StarReLU to find great potential in MetaFormer- like models alongside other neural networks. Code and models are available at https://github.com/sail-sg/metaformer.

6.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10012-10026, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37027609

RESUMO

Existing 3D human pose estimation methods often suffer inferior generalization performance to new datasets, largely due to the limited diversity of 2D-3D pose pairs in the training data. To address this problem, we present PoseAug, a novel auto-augmentation framework that learns to augment the available training poses towards greater diversity and thus enhances the generalization power of the trained 2D-to-3D pose estimator. Specifically, PoseAug introduces a novel pose augmentor that learns to adjust various geometry factors of a pose through differentiable operations. With such differentiable capacity, the augmentor can be jointly optimized with the 3D pose estimator and take the estimation error as feedback to generate more diverse and harder poses in an online manner. PoseAug is generic and handy to be applied to various 3D pose estimation models. It is also extendable to aid pose estimation from video frames. To demonstrate this, we introduce PoseAug-V, a simple yet effective method that decomposes video pose augmentation into end pose augmentation and conditioned intermediate pose generation. Extensive experiments demonstrate that PoseAug and its extension PoseAug-V bring clear improvements for frame-based and video-based 3D pose estimation on several out-of-domain 3D human pose benchmarks.


Assuntos
Algoritmos , Imageamento Tridimensional , Humanos , Imageamento Tridimensional/métodos
7.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10795-10816, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37074896

RESUMO

Deep long-tailed learning, one of the most challenging problems in visual recognition, aims to train well-performing deep models from a large number of images that follow a long-tailed class distribution. In the last decade, deep learning has emerged as a powerful recognition model for learning high-quality image representations and has led to remarkable breakthroughs in generic visual recognition. However, long-tailed class imbalance, a common problem in practical visual recognition tasks, often limits the practicality of deep network based recognition models in real-world applications, since they can be easily biased towards dominant classes and perform poorly on tail classes. To address this problem, a large number of studies have been conducted in recent years, making promising progress in the field of deep long-tailed learning. Considering the rapid evolution of this field, this article aims to provide a comprehensive survey on recent advances in deep long-tailed learning. To be specific, we group existing deep long-tailed learning studies into three main categories (i.e., class re-balancing, information augmentation and module improvement), and review these methods following this taxonomy in detail. Afterward, we empirically analyze several state-of-the-art methods by evaluating to what extent they address the issue of class imbalance via a newly proposed evaluation metric, i.e., relative accuracy. We conclude the survey by highlighting important applications of deep long-tailed learning and identifying several promising directions for future research.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 1328-1334, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-35077359

RESUMO

In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies and meanwhile avoid the attention building process in transformers. The outputs are then aggregated in a mutually complementing manner to form expressive representations. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy, greatly improving the performance of recent state-of-the-art MLP-like networks for visual recognition. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. PyTorch/MindSpore/Jittor code is available at https://github.com/Andrew-Qibin/VisionPermutator.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 12738-12746, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36155475

RESUMO

Vision transformers have recently attained state-of-the-art results in visual recognition tasks. Their success is largely attributed to the self-attention component, which models the global dependencies among the image patches (tokens) and aggregates them into higher-level features. However, self-attention brings significant training difficulties to ViTs. Many recent works thus develop various new self-attention components to alleviate this issue. In this article, instead of developing complicated self-attention mechanism, we aim to explore simple approaches to fully release the potential of the vanilla self-attention. We first study the token selection behavior of self-attention and find that it suffers from a low diversity due to attention over-smoothing, which severely limits its effectiveness in learning discriminative token features. We then develop simple approaches to enhance selectivity and diversity for self-attention in token selection. The resulted token selector module can server as a drop-in module for various ViT backbones and consistently boost their performance. Significantly, they enable ViTs to achieve 84.6% top-1 classification accuracy on ImageNet with only 25M parameters. When scaled up to 81M parameters, the result can be further improved to 86.1%. In addition, we also present comprehensive experiments to demonstrate the token selector can be applied to a variety of transformer-based models to boost their performance for image classification, semantic segmentation and NLP tasks. Code is available at https://github.com/zhoudaquan/dvit_repo.

10.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 6575-6586, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-36094970

RESUMO

Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic architecture, termed Vision Outlooker (VOLO), which implements a novel outlook attention operation that dynamically conduct the local feature aggregation mechanism in a sliding window manner across the input image. Unlike self-attention that focuses on modeling global dependencies of local features at a coarse level, our outlook attention targets at encoding finer-level features, which is critical for recognition but ignored by self-attention. Outlook attention breaks the bottleneck of self-attention whose computation cost scales quadratically with the input spatial dimension, and thus is much more memory efficient. Compared to our Tokens-To-Token Vision Transformer (T2T-ViT), VOLO can more efficiently encode fine-level features that are essential for high-performance visual recognition. Experiments show that with only 26.6 M learnable parameters, VOLO achieves 84.2% top-1 accuracy on ImageNet-1 K without using extra training data, 2.7% better than T2T-ViT with a comparable number of parameters. When the model size is scaled up to 296 M parameters, its performance can be further improved to 87.1%, setting a new record for ImageNet-1 K classification. In addition, we also take the proposed VOLO as pretrained models and report superior performance on downstream tasks, such as semantic segmentation. Code is available at https://github.com/sail-sg/volo.

11.
Nat Neurosci ; 25(6): 795-804, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35578132

RESUMO

We propose a simple framework-meta-matching-to translate predictive models from large-scale datasets to new unseen non-brain-imaging phenotypes in small-scale studies. The key consideration is that a unique phenotype from a boutique study likely correlates with (but is not the same as) related phenotypes in some large-scale dataset. Meta-matching exploits these correlations to boost prediction in the boutique study. We apply meta-matching to predict non-brain-imaging phenotypes from resting-state functional connectivity. Using the UK Biobank (N = 36,848) and Human Connectome Project (HCP) (N = 1,019) datasets, we demonstrate that meta-matching can greatly boost the prediction of new phenotypes in small independent datasets in many scenarios. For example, translating a UK Biobank model to 100 HCP participants yields an eight-fold improvement in variance explained with an average absolute gain of 4.0% (minimum = -0.2%, maximum = 16.0%) across 35 phenotypes. With a growing number of large-scale datasets collecting increasingly diverse phenotypes, our results represent a lower bound on the potential of meta-matching.


Assuntos
Encéfalo , Conectoma , Encéfalo/diagnóstico por imagem , Conectoma/métodos , Humanos , Imageamento por Ressonância Magnética/métodos , Fenótipo
12.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 8602-8617, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-34383644

RESUMO

Unsupervised domain adaptation (UDA) aims to transfer knowledge from a related but different well-labeled source domain to a new unlabeled target domain. Most existing UDA methods require access to the source data, and thus are not applicable when the data are confidential and not shareable due to privacy concerns. This paper aims to tackle a realistic setting with only a classification model available trained over, instead of accessing to, the source data. To effectively utilize the source model for adaptation, we propose a novel approach called Source HypOthesis Transfer (SHOT), which learns the feature extraction module for the target domain by fitting the target data features to the frozen source classification module (representing classification hypothesis). Specifically, SHOT exploits both information maximization and self-supervised learning for the feature extraction module learning to ensure the target features are implicitly aligned with the features of unseen source data via the same hypothesis. Furthermore, we propose a new labeling transfer strategy, which separates the target data into two splits based on the confidence of predictions (labeling information), and then employ semi-supervised learning to improve the accuracy of less-confident predictions in the target domain. We denote labeling transfer as SHOT++ if the predictions are obtained by SHOT. Extensive experiments on both digit classification and object recognition tasks show that SHOT and SHOT++ achieve results surpassing or comparable to the state-of-the-arts, demonstrating the effectiveness of our approaches for various visual domain adaptation problems. Code will be available at https://github.com/tim-learn/SHOT-plus.


Assuntos
Algoritmos , Redes Neurais de Computação
13.
IEEE Trans Pattern Anal Mach Intell ; 44(1): 474-487, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-32750831

RESUMO

Despite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intra-class variations. As opposed to current techniques for age-invariant face recognition, which either directly extract age-invariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other. To this end, we propose a deep Age-Invariant Model (AIM) for face recognition in the wild with three distinct novelties. First, AIM presents a novel unified deep architecture jointly performing cross-age face synthesis and recognition in a mutual boosting way. Second, AIM achieves continuous face rejuvenation/aging with remarkable photorealistic and identity-preserving properties, avoiding the requirement of paired data and the true age of testing samples. Third, effective and novel training strategies are developed for end-to-end learning of the whole deep architecture, which generates powerful age-invariant face representations explicitly disentangled from the age variation. Moreover, we construct a new large-scale Cross-Age Face Recognition (CAFR) benchmark dataset to facilitate existing efforts and push the frontiers of age-invariant face recognition research. Extensive experiments on both our CAFR dataset and several other cross-age datasets (MORPH, CACD, and FG-NET) demonstrate the superiority of the proposed AIM model over the state-of-the-arts. Benchmarking our model on the popular unconstrained face recognition datasets YTF and IJB-C additionally verifies its promising generalization ability in recognizing faces in the wild.


Assuntos
Reconhecimento Facial , Envelhecimento , Algoritmos , Face , Humanos , Aprendizagem
14.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 8569-8586, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34029186

RESUMO

Existing video rain removal methods mainly focus on rain streak removal and are solely trained based on the synthetic data, which neglect more complex degradation factors, e.g., rain accumulation, and the prior knowledge in real rain data. Thus, in this paper, we build a more comprehensive rain model with several degradation factors and construct a novel two-stage video rain removal method that combines the power of synthetic videos and real data. Specifically, a novel two-stage progressive network is proposed: recovery guided by a physics model, and further restoration by adversarial learning. The first stage performs an inverse recovery process guided by our proposed rain model. An initially estimated background frame is obtained based on the input rain frame. The second stage employs adversarial learning to refine the result, i.e., recovering the overall color and illumination distributions of the frame, the background details that are failed to be recovered in the first stage, and removing the artifacts generated in the first stage. Furthermore, we also introduce a more comprehensive rain model that includes degradation factors, e.g., occlusion and rain accumulation, which appear in real scenes yet ignored by existing methods. This model, which generates more realistic rain images, will train and evaluate our models better. Extensive evaluations on synthetic and real videos show the effectiveness of our method in comparisons to the state-of-the-art methods. Our datasets, results and code are available at: https://github.com/flyywh/Recurrent-Multi-Frame-Deraining.

15.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 8538-8551, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-34033534

RESUMO

In this paper, we address the makeup transfer and removal tasks simultaneously, which aim to transfer the makeup from a reference image to a source image and remove the makeup from the with-makeup image respectively. Existing methods have achieved much advancement in constrained scenarios, but it is still very challenging for them to transfer makeup between images with large pose and expression differences, or handle makeup details like blush on cheeks or highlight on the nose. In addition, they are hardly able to control the degree of makeup during transferring or to transfer a specified part in the input face. These defects limit the application of previous makeup transfer methods to real-world scenarios. In this work, we propose a Pose and expression robust Spatial-aware GAN (abbreviated as PSGAN++). PSGAN++ is capable of performing both detail-preserving makeup transfer and effective makeup removal. For makeup transfer, PSGAN++ uses a Makeup Distill Network (MDNet) to extract makeup information, which is embedded into spatial-aware makeup matrices. We also devise an Attentive Makeup Morphing (AMM) module that specifies how the makeup in the source image is morphed from the reference image, and a makeup detail loss to supervise the model within the selected makeup detail area. On the other hand, for makeup removal, PSGAN++ applies an Identity Distill Network (IDNet) to embed the identity information from with-makeup images into identity matrices. Finally, the obtained makeup/identity matrices are fed to a Style Transfer Network (STNet) that is able to edit the feature maps to achieve makeup transfer or removal. To evaluate the effectiveness of our PSGAN++, we collect a Makeup Transfer In the Wild (MT-Wild) dataset that contains images with diverse poses and expressions and a Makeup Transfer High-Resolution (MT-HR) dataset that contains high-resolution images. Experiments demonstrate that PSGAN++ not only achieves state-of-the-art results with fine makeup details even in cases of large pose/expression differences but also can perform partial or degree-controllable makeup transfer. Both the code and the newly collected datasets will be released at https://github.com/wtjiang98/PSGAN.


Assuntos
Algoritmos
16.
Artigo em Inglês | MEDLINE | ID: mdl-34936556

RESUMO

Recent state-of-the-art one-stage instance segmentation model SOLO divides the input image into a grid and directly predicts per grid cell object masks with fully-convolutional networks, yielding comparably good performance as traditional two-stage Mask R-CNN yet enjoying much simpler architecture and higher efficiency. We observe SOLO generates similar masks for an object at nearby grid cells, and these neighboring predictions can complement each other as some may better segment certain object part, most of which are however directly discarded by non-maximum-suppression. Motivated by the observed gap, we develop a novel learning-based aggregation method that improves upon SOLO by leveraging the rich neighboring information while maintaining the architectural efficiency. The resulting model is named SODAR. Unlike the original per grid cell object masks, SODAR is implicitly supervised to learn mask representations that encode geometric structure of nearby objects and complement adjacent representations with context. The aggregation method further includes two novel designs: 1) a mask interpolation mechanism that enables the model to generate much fewer mask representations by sharing neighboring representations among nearby grid cells, and thus saves computation and memory; 2) a deformable neighbour sampling mechanism that allows the model to adaptively adjust neighbor sampling locations thus gathering mask representations with more relevant context and achieving higher performance. SODAR significantly improves the instance segmentation performance, e.g., it outperforms a SOLO model with ResNet-101 backbone by 2.2 AP on COCO test set, with only about 3% additional computation. We further show consistent performance gain with the SOLOv2 model.

17.
IEEE Trans Image Process ; 30: 7499-7510, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34460375

RESUMO

Garment transfer aims to transfer the desired garment from a model image with the desired clothing to a target person, which has attracted a great deal of attention due to its wider potential applications. However, considering the model and target persons are often given at different views, body shapes and poses, realistic garment transfer is facing the following challenges that have not been well addressed: 1) deforming the garment; 2) inferring unobserved appearance; 3) preserving fine texture details. To tackle these challenges, we propose a novel SPatial-Aware Texture Transformer (SPATT) model. Different from existing models, SPATT establishes correspondence and infers unobserved clothing appearance by leveraging the spatial prior information of a UV-space. Specifically, the source image is transformed into a partial UV texture map guided by the extracted dense pose. To better infer the unseen appearance utilizing seen region, we first propose a novel coordinate-prior map that defines the spatial relationship between the coordinates in the UV texture map, and design an algorithm to compute it. Based on the proposed coordinate-prior map, we present a novel spatial-aware texture generation network to complete the partial UV texture. In the second stage, we first transform the completed UV texture to fit the target person. To polish the details and improve realism, we introduce a refinement generative network conditioned on the warped image and source input. Compared with existing frameworks as shown experimentally, the proposed framework can generate more realistic images with better-preserved texture details. Furthermore, difficult cases where two persons have large pose and view differences can also be well handled by SPATT.

18.
IEEE Trans Image Process ; 30: 6096-6106, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34185641

RESUMO

Human motion prediction, which aims to predict future human poses given past poses, has recently seen increased interest. Many recent approaches are based on Recurrent Neural Networks (RNN) which model human poses with exponential maps. These approaches neglect the pose velocity as well as temporal relation of different poses, and tend to converge to the mean pose or fail to generate natural-looking poses. We therefore propose a novel Position-Velocity Recurrent Encoder-Decoder (PVRED) for human motion prediction, which makes full use of pose velocities and temporal positional information. A temporal position embedding method is presented and a Position-Velocity RNN (PVRNN) is proposed. We also emphasize the benefits of quaternion parameterization of poses and design a novel trainable Quaternion Transformation (QT) layer, which is combined with a robust loss function during training. We provide quantitative results for both short-term prediction in the future 0.5 seconds and long-term prediction in the future 0.5 to 1 seconds. Experiments on several benchmarks show that our approach considerably outperforms the state-of-the-art methods. In addition, qualitative visualizations in the future 4 seconds show that our approach could predict future human-like and meaningful poses in very long time horizons. Code is publicly available on GitHub: https://github.com/hongsong-wang/PVRNN.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Movimento/fisiologia , Redes Neurais de Computação , Algoritmos , Humanos , Gravação em Vídeo
19.
IEEE Trans Image Process ; 30: 5835-5847, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34138709

RESUMO

The Coarse-To-Fine (CTF) matching scheme has been widely applied to reduce computational complexity and matching ambiguity in stereo matching and optical flow tasks by converting image pairs into multi-scale representations and performing matching from coarse to fine levels. Despite its efficiency, it suffers from several weaknesses, such as tending to blur the edges and miss small structures like thin bars and holes. We find that the pixels of small structures and edges are often assigned with wrong disparity/flow in the upsampling process of the CTF framework, introducing errors to the fine levels and leading to such weaknesses. We observe that these wrong disparity/flow values can be avoided if we select the best-matched value among their neighborhood, which inspires us to propose a novel differentiable Neighbor-Search Upsampling (NSU) module. The NSU module first estimates the matching scores and then selects the best-matched disparity/flow for each pixel from its neighbors. It effectively preserves finer structure details by exploiting the information from the finer level while upsampling the disparity/flow. The proposed module can be a drop-in replacement of the naive upsampling in the CTF matching framework and allows the neural networks to be trained end-to-end. By integrating the proposed NSU module into a baseline CTF matching network, we design our Detail Preserving Coarse-To-Fine (DPCTF) matching network. Comprehensive experiments demonstrate that our DPCTF can boost performances for both stereo matching and optical flow tasks. Notably, our DPCTF achieves new state-of-the-art performances for both tasks - it outperforms the competitive baseline (Bi3D) by 28.8% (from 0.73 to 0.52) on EPE of the FlyingThings3D stereo dataset, and ranks first in KITTI flow 2012 benchmark. The code is available at https://github.com/Deng-Y/DPCTF.

20.
IEEE Trans Image Process ; 30: 4587-4598, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33872147

RESUMO

Feature pyramid network (FPN) based models, which fuse the semantics and salient details in a progressive manner, have been proven highly effective in salient object detection. However, it is observed that these models often generate saliency maps with incomplete object structures or unclear object boundaries, due to the indirect information propagation among distant layers that makes such fusion structure less effective. In this work, we propose a novel Cross-layer Feature Pyramid Network (CFPN), in which direct cross-layer communication is enabled to improve the progressive fusion in salient object detection. Specifically, the proposed network first aggregates multi-scale features from different layers into feature maps that have access to both the high- and low- level information. Then, it distributes the aggregated features to all the involved layers to gain access to richer context. In this way, the distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information during the progressive feature fusion. At last, CFPN fuses the distributed features of each layer stage-by-stage. This way, the high-level features that contain context useful for locating complete objects are preserved until the final output layer, and the low-level features that contain spatial structure details are embedded into each layer to preserve spatial structural details. Extensive experimental results over six widely used salient object detection benchmarks and with three popular backbones clearly demonstrate that CFPN can accurately locate fairly complete salient regions and effectively segment the object boundaries.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA