Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 3 de 3
Filter
Add more filters










Database
Language
Publication year range
1.
IEEE Trans Image Process ; 32: 6543-6557, 2023.
Article in English | MEDLINE | ID: mdl-37922168

ABSTRACT

Self-supervised space-time correspondence learning utilizing unlabeled videos holds great potential in computer vision. Most existing methods rely on contrastive learning with mining negative samples or adapting reconstruction from the image domain, which requires dense affinity across multiple frames or optical flow constraints. Moreover, video correspondence prediction models need to uncover more inherent properties of the video, such as structural information. In this work, we propose HiGraph+, a sophisticated space-time correspondence framework based on learnable graph kernels. By treating videos as a spatial-temporal graph, the learning objective of HiGraph+ is issued in a self-supervised manner, predicting the unobserved hidden graph via graph kernel methods. First, we learn the structural consistency of sub-graphs in graph-level correspondence learning. Furthermore, we introduce a spatio-temporal hidden graph loss through contrastive learning that facilitates learning temporal coherence across frames of sub-graphs and spatial diversity within the same frame. Therefore, we can predict long-term correspondences and drive the hidden graph to acquire distinct local structural representations. Then, we learn a refined representation across frames on the node-level via a dense graph kernel. The structural and temporal consistency of the graph forms the self-supervision of model training. HiGraph+ achieves excellent performance and demonstrates robustness in benchmark tests involving object, semantic part, keypoint, and instance labeling propagation tasks. Our algorithm implementations have been made publicly available at https://github.com/zyqin19/HiGraph.

2.
IEEE Trans Image Process ; 32: 2678-2692, 2023.
Article in English | MEDLINE | ID: mdl-37155388

ABSTRACT

Learning pyramidal feature representations is important for many dense prediction tasks (e.g., object detection, semantic segmentation) that demand multi-scale visual understanding. Feature Pyramid Network (FPN) is a well-known architecture for multi-scale feature learning, however, intrinsic weaknesses in feature extraction and fusion impede the production of informative features. This work addresses the weaknesses of FPN through a novel tripartite feature enhanced pyramid network (TFPN), with three distinct and effective designs. First, we develop a feature reference module with lateral connections to adaptively extract bottom-up features with richer details for feature pyramid construction. Second, we design a feature calibration module between adjacent layers that calibrates the upsampled features to be spatially aligned, allowing for feature fusion with accurate correspondences. Third, we introduce a feature feedback module in FPN, which creates a communication channel from the feature pyramid back to the bottom-up backbone and doubles the encoding capacity, enabling the entire architecture to generate incrementally more powerful representations. The TFPN is extensively evaluated over four popular dense prediction tasks, i.e., object detection, instance segmentation, panoptic segmentation, and semantic segmentation. The results demonstrate that TFPN consistently and significantly outperforms the vanilla FPN. Our code is available at https://github.com/jamesliang819.

3.
Article in English | MEDLINE | ID: mdl-29994210

ABSTRACT

The ability to rank images based on their appearance finds many real-world applications such as image retrieval or image album creation. Despite the recent dominance of deep learning methods in computer vision which often result in superior performance, they are not always the methods of choice because they lack interpretability. In this work, we investigate the possibility of improving image aesthetic inference of convolutional neural networks with hand-designed features that rely on domain expertise in various fields. We perform a comparison of hand-crafted feature sets in their ability to predict fine-grained aesthetics scores on two image aesthetics datasets. We observe that even feature sets published earlier are able to compete with more recently published algorithms and, by combining the algorithms together, one can obtain a significant improvement in predicting image aesthetics. By using a tree-based learner, we perform feature elimination to understand the best performing features overall and across different image categories. Only roughly 15 % and 8 % of the features are needed to achieve full performance in predicting a fine-grained aesthetic score and binary classification respectively. By combining hand-crafted features with meta-features that predict the quality of an image based on CNN features, the model performs better than a baseline VGG16 model. One can, however, achieve more significant improvement in both aesthetics score prediction and binary classification by fusing the hand-crafted features and the penultimate layer activations. Our experiments indicate an improvement up to 2.2 % achieving current state-of-the-art binary classification accuracy on the AVA dataset when the hand-designed features are fused with activation from VGG16 and ResNet50 networks.

SELECTION OF CITATIONS
SEARCH DETAIL
...