Pesquisa | BVS IEC

1.

Vanishing point attracts gaze in free-viewing and visual search tasks.

Borji, Ali; Feng, Mengyang; Lu, Huchuan.

J Vis ; 16(14): 18, 2016 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-27903005

RESUMO

Several structural scene cues such as gist, layout, horizontal line, openness, and depth have been shown to guide scene perception (e.g., Oliva & Torralba, 2001); Ross & Oliva, 2009). Here, to investigate whether vanishing point (VP) plays a significant role in gaze guidance, we ran two experiments. In the first one, we recorded fixations of 10 observers (six male, four female; mean age 22; SD = 0.84) freely viewing 532 images, out of which 319 had a VP (shuffled presentation; each image for 4 s). We found that the average number of fixations at a local region (80 × 80 pixels) centered at the VP is significantly higher than the average fixations at random locations (t test; n = 319; p < 0.001). To address the confounding factor of saliency, we learned a combined model of bottom-up saliency and VP. The AUC (area under curve) score of our model (0.85; SD = 0.01) is significantly higher than the base saliency model (e.g., 0.8 using attention for information maximization (AIM) model by Bruce & Tsotsos, 2005, t test; p = 3.14e-16) and the VP-only model (0.64, t test; p < 0.001). In the second experiment, we asked 14 subjects (10 male, four female; mean age 23.07, SD = 1.26) to search for a target character (T or L) placed randomly on a 3 × 3 imaginary grid overlaid on top of an image. Subjects reported their answers by pressing one of the two keys. Stimuli consisted of 270 color images (180 with a single VP, 90 without). The target happened with equal probability inside each cell (15 times L, 15 times T). We found that subjects were significantly faster (and more accurate) when the target appeared inside the cell containing the VP compared to cells without the VP (median across 14 subjects 1.34 s vs. 1.96 s; Wilcoxon rank-sum test; p = 0.0014). These findings support the hypothesis that vanishing point, similar to face, text (Cerf, Frady, & Koch, 2009), and gaze direction Borji, Parks, & Itti, 2014) guides attention in free-viewing and visual search tasks.

Assuntos

Movimentos Oculares/fisiologia , Fixação Ocular/fisiologia , Reconhecimento Visual de Modelos/fisiologia , Percepção Visual/fisiologia , Atenção/fisiologia , Sinais (Psicologia) , Feminino , Humanos , Masculino , Probabilidade , Adulto Jovem

2.

GroupMorph: Medical Image Registration via Grouping Network with Contextual Fusion.

Tan, Zuopeng; Zhang, Lihe; Lv, Yanan; Ma, Yili; Lu, Huchuan.

IEEE Trans Med Imaging ; PP2024 May 13.

Artigo em Inglês | MEDLINE | ID: mdl-38739510

RESUMO

Pyramid-based deformation decomposition is a promising registration framework, which gradually decomposes the deformation field into multi-resolution subfields for precise registration. However, most pyramid-based methods directly produce one subfield per resolution level, which does not fully depict the spatial deformation. In this paper, we propose a novel registration model, called GroupMorph. Different from typical pyramid-based methods, we adopt the grouping-combination strategy to predict deformation field at each resolution. Specifically, we perform group-wise correlation calculation to measure the similarities of grouped features. After that, n groups of deformation subfields with different receptive fields are predicted in parallel. By composing these subfields, a deformation field with multi-receptive field ranges is formed, which can effectively identify both large and small deformations. Meanwhile, a contextual fusion module is designed to fuse the contextual features and provide the inter-group information for the field estimator of the next level. By leveraging the inter-group correspondence, the synergy among deformation subfields is enhanced. Extensive experiments on four public datasets demonstrate the effectiveness of GroupMorph. Code is available at https://github.com/TVayne/GroupMorph.

3.

Deep Boosting Learning: A Brand-New Cooperative Approach for Image-Text Matching.

Diao, Haiwen; Zhang, Ying; Gao, Shang; Ruan, Xiang; Lu, Huchuan.

IEEE Trans Image Process ; 33: 3341-3352, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38713578

RESUMO

Image-text matching remains a challenging task due to heterogeneous semantic diversity across modalities and insufficient distance separability within triplets. Different from previous approaches focusing on enhancing multi-modal representations or exploiting cross-modal correspondence for more accurate retrieval, in this paper we aim to leverage the knowledge transfer between peer branches in a boosting manner to seek a more powerful matching model. Specifically, we propose a brand-new Deep Boosting Learning (DBL) algorithm, where an anchor branch is first trained to provide insights into the data properties, with a target branch gaining more advanced knowledge to develop optimal features and distance metrics. Concretely, an anchor branch initially learns the absolute or relative distance between positive and negative pairs, providing a foundational understanding of the particular network and data distribution. Building upon this knowledge, a target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples. Extensive experiments validate that our DBL can achieve impressive and consistent improvements based on various recent state-of-the-art models in the image-text matching field, and outperform related popular cooperative strategies, e.g., Conventional Distillation, Mutual Learning, and Contrastive Learning. Beyond the above, we confirm that DBL can be seamlessly integrated into their training scenarios and achieve superior performance under the same computational costs, demonstrating the flexibility and broad applicability of our proposed method.

4.

ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection.

Pang, Youwei; Zhao, Xiaoqi; Xiang, Tian-Zhu; Zhang, Lihe; Lu, Huchuan.

IEEE Trans Pattern Anal Mach Intell ; PP2024 Jun 21.

Artigo em Inglês | MEDLINE | ID: mdl-38905087

RESUMO

Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and videos, i.e. zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame differences in spatiotemporal scenarios and be adaptively deactivated and output all-zero results for static representations. They provide a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks.

5.

Referring Image Segmentation With Fine-Grained Semantic Funneling Infusion.

Yang, Jiaxing; Zhang, Lihe; Lu, Huchuan.

IEEE Trans Neural Netw Learn Syst ; PP2023 Jun 13.

Artigo em Inglês | MEDLINE | ID: mdl-37310821

RESUMO

Recently, referring image segmentation has attracted wide attention given its huge potential in human-robot interaction. Networks to identify the referred region must have a deep understanding of both the image and language semantics. To do so, existing works tend to design various mechanisms to achieve cross-modality fusion, for example, tile and concatenation and vanilla nonlocal manipulation. However, the plain fusion usually is either coarse or constrained by the exorbitant computation overhead, finally causing not enough understanding of the referent. In this work, we propose a fine-grained semantic funneling infusion (FSFI) mechanism to solve the problem. The FSFI introduces a constant spatial constraint on the querying entities from different encoding stages and dynamically infuses the gleaned language semantic into the vision branch. Moreover, it decomposes the features from different modalities into more delicate components, allowing the fusion to happen in multiple low-dimensional spaces. The fusion is more effective than the one only happening in one high-dimensional space, given its ability to sink more representative information along the channel dimension. Another problem haunting the task is that the instilling of high-abstract semantic will blur the details of the referent. Targetedly, we propose a multiscale attention-enhanced decoder (MAED) to alleviate the problem. We design a detail enhancement operator (DeEh) and apply it in a multiscale and progressive way. Features from the higher level are used to generate attention guidance to enlighten the lower-level features to more attend to the detail regions. Extensive results on the challenging benchmarks show that our network performs favorably against the state-of-the-arts (SOTAs).

6.

Interactive Feature Embedding for Infrared and Visible Image Fusion.

Zhao, Fan; Zhao, Wenda; Lu, Huchuan.

IEEE Trans Neural Netw Learn Syst ; PP2023 Apr 11.

Artigo em Inglês | MEDLINE | ID: mdl-37040245

RESUMO

General deep learning-based methods for infrared and visible image fusion rely on the unsupervised mechanism for vital information retention by utilizing elaborately designed loss functions. However, the unsupervised mechanism depends on a well-designed loss function, which cannot guarantee that all vital information of source images is sufficiently extracted. In this work, we propose a novel interactive feature embedding in a self-supervised learning framework for infrared and visible image fusion, attempting to overcome the issue of vital information degradation. With the help of a self-supervised learning framework, hierarchical representations of source images can be efficiently extracted. In particular, interactive feature embedding models are tactfully designed to build a bridge between self-supervised learning and infrared and visible image fusion learning, achieving vital information retention. Qualitative and quantitative evaluations exhibit that the proposed method performs favorably against state-of-the-art methods.

7.

Deeply Coupled Convolution-Transformer With Spatial-Temporal Complementary Learning for Video-Based Person Re-Identification.

Liu, Xuehu; Yu, Chenyang; Zhang, Pingping; Lu, Huchuan.

IEEE Trans Neural Netw Learn Syst ; PP2023 May 26.

Artigo em Inglês | MEDLINE | ID: mdl-37235467

RESUMO

Advanced deep convolutional neural networks (CNNs) have shown great success in video-based person re-identification (Re-ID). However, they usually focus on the most obvious regions of persons with a limited global representation ability. Recently, it witnesses that Transformers explore the interpatch relationships with global observations for performance improvements. In this work, we take both the sides and propose a novel spatial-temporal complementary learning framework named deeply coupled convolution-transformer (DCCT) for high-performance video-based person Re-ID. First, we couple CNNs and Transformers to extract two kinds of visual features and experimentally verify their complementarity. Furthermore, in spatial, we propose a complementary content attention (CCA) to take advantages of the coupled structure and guide independent features for spatial complementary learning. In temporal, a hierarchical temporal aggregation (HTA) is proposed to progressively capture the interframe dependencies and encode temporal information. Besides, a gated attention (GA) is used to deliver aggregated temporal information into the CNN and Transformer branches for temporal complementary learning. Finally, we introduce a self-distillation training strategy to transfer the superior spatial-temporal knowledge to backbone networks for higher accuracy and more efficiency. In this way, two kinds of typical features from same videos are integrated mechanically for more informative representations. Extensive experiments on four public Re-ID benchmarks demonstrate that our framework could attain better performances than most state-of-the-art methods.

8.

CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection.

Pang, Youwei; Zhao, Xiaoqi; Zhang, Lihe; Lu, Huchuan.

IEEE Trans Image Process ; PP2023 Jan 11.

Artigo em Inglês | MEDLINE | ID: mdl-37018701

RESUMO

Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.

9.

PANet: Patch-Aware Network for Light Field Salient Object Detection.

Piao, Yongri; Jiang, Yongyao; Zhang, Miao; Wang, Jian; Lu, Huchuan.

IEEE Trans Cybern ; 53(1): 379-391, 2023 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-34406954

RESUMO

Most existing light field saliency detection methods have achieved great success by exploiting unique light field data-focus information in focal slices. However, they process light field data in a slicewise way, leading to suboptimal results because the relative contribution of different regions in focal slices is ignored. How we can comprehensively explore and integrate focused saliency regions that would positively contribute to accurate saliency detection. Answering this question inspires us to develop a new insight. In this article, we propose a patch-aware network to explore light field data in a regionwise way. First, we excavate focused salient regions with a proposed multisource learning module (MSLM), which generates a filtering strategy for integration followed by three guidances based on saliency, boundary, and position. Second, we design a sharpness recognition module (SRM) to refine and update this strategy and perform feature integration. With our proposed MSLM and SRM, we can obtain more accurate and complete saliency maps. Comprehensive experiments on three benchmark datasets prove that our proposed method achieves competitive performance over 2-D, 3-D, and 4-D salient object detection methods. The code and results of our method are available at https://github.com/OIPLab-DUT/IEEE-TCYB-PANet.

10.

Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation.

Feng, Guang; Hu, Zhiwei; Zhang, Lihe; Sun, Jiayu; Lu, Huchuan.

IEEE Trans Neural Netw Learn Syst ; 34(5): 2246-2258, 2023 May.

Artigo em Inglês | MEDLINE | ID: mdl-34469313

RESUMO

Recently, referring image localization and segmentation has aroused widespread interest. However, the existing methods lack a clear description of the interdependence between language and vision. To this end, we present a bidirectional relationship inferring network (BRINet) to effectively address the challenging tasks. Specifically, we first employ a vision-guided linguistic attention module to perceive the keywords corresponding to each image region. Then, language-guided visual attention adopts the learned adaptive language to guide the update of the visual features. Together, they form a bidirectional cross-modal attention module (BCAM) to achieve the mutual guidance between language and vision. They can help the network align the cross-modal features better. Based on the vanilla language-guided visual attention, we further design an asymmetric language-guided visual attention, which significantly reduces the computational cost by modeling the relationship between each pixel and each pooled subregion. In addition, a segmentation-guided bottom-up augmentation module (SBAM) is utilized to selectively combine multilevel information flow for object localization. Experiments show that our method outperforms other state-of-the-art methods on three referring image localization datasets and four referring image segmentation datasets.

11.

Effective Local and Global Search for Fast Long-Term Tracking.

Zhao, Haojie; Yan, Bin; Wang, Dong; Qian, Xuesheng; Yang, Xiaoyun; Lu, Huchuan.

IEEE Trans Pattern Anal Mach Intell ; 45(1): 460-474, 2023 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-35196229

RESUMO

Compared with short-term tracking, long-term tracking remains a challenging task that usually requires the tracking algorithm to track targets within a local region and re-detect targets over the entire image. However, few works have been done and their performances have also been limited. In this paper, we present a novel robust and real-time long-term tracking framework based on the proposed local search module and re-detection module. The local search module consists of an effective bounding box regressor to generate a series of candidate proposals and a target verifier to infer the optimal candidate with its confidence score. For local search, we design a long short-term updated scheme to improve the target verifier. The verification capability of the tracker can be improved by using several templates updated at different times. Based on the verification scores, our tracker determines whether the tracked object is present or absent and then chooses the tracking strategies of local or global search, respectively, in the next frame. For global re-detection, we develop a novel re-detection module that can estimate the target position and target size for a given base tracker. We conduct a series of experiments to demonstrate that this module can be flexibly integrated into many other tracking algorithms for long-term tracking and that it can improve long-term tracking performance effectively. Numerous experiments and discussions are conducted on several popular tracking datasets, including VOT, OxUvA, TLP, and LaSOT. The experimental results demonstrate that the proposed tracker achieves satisfactory performance with a real-time speed. Code is available at https://github.com/difhnp/ELGLT.

12.

Robust Online Tracking With Meta-Updater.

Zhao, Jie; Dai, Kenan; Zhang, Pengyu; Wang, Dong; Lu, Huchuan.

IEEE Trans Pattern Anal Mach Intell ; 45(5): 6168-6182, 2023 May.

Artigo em Inglês | MEDLINE | ID: mdl-36040937

RESUMO

In a sequence, the appearance of both the target and background often changes dramatically. Offline-trained models may not handle huge appearance variations well, causing tracking failures. Most discriminative trackers address this issue by introducing an online update scheme, making the model dynamically adapt the changes of the target and background. Although the online update scheme plays an important role in improving the tracker's accuracy, it inevitably pollutes the model with noisy observation samples. It is necessary to reduce the risk of the online update scheme for better tracking. In this work, we propose a novel offline-trained Meta-Updater to address an important but unsolved problem: Is the tracker ready for updating in the current frame? The proposed module can effectively integrate geometric, discriminative, and appearance cues in a sequential manner, and then mine the sequential information with a designed cascaded LSTM module. Moreover, we strengthen the effect of appearance information on the module, i.e., the additional local outlier factor is introduced to integrate into a newly designed network. We integrate our meta-updater into eight different types of online update trackers. Extensive experiments on four long-term and two short-term tracking benchmarks demonstrate that our meta-updater is effective and has strong generalization ability.

13.

Referring Segmentation via Encoder-Fused Cross-Modal Attention Network.

Feng, Guang; Zhang, Lihe; Sun, Jiayu; Hu, Zhiwei; Lu, Huchuan.

IEEE Trans Pattern Anal Mach Intell ; 45(6): 7654-7667, 2023 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-36367919

RESUMO

This paper focuses on referring segmentation, which aims to selectively segment the corresponding visual region in an image (or video) according to the referring expression. However, the existing methods usually consider the interaction between multi-modal features at the decoding end of the network. Specifically, they interact the visual features of each scale with language respectively, thus ignoring the correlation between multi-scale features. In this work, we present an encoder fusion network (EFN), which transfers the multi-modal feature learning process from the decoding end to the encoding end and realizes the gradual refinement of multi-modal features by the language. In EFN, we also adopt a co-attention mechanism to promote the mutual alignment of language and visual information in feature space. In the decoding stage, a boundary enhancement module (BEM) is proposed to enhance the network's attention to the details of the target. For video data, we introduce an asymmetric cross-frame attention module (ACFM) to effectively capture the temporal information from the video frames by computing the relationship between each pixel of the current frame and each pooled sub-region of the reference frames. Extensive experiments on referring image/video segmentation datasets show that our method outperforms the state-of-the-art performance.

14.

Nowhere to Disguise: Spot Camouflaged Objects via Saliency Attribute Transfer.

Zhao, Wenda; Xie, Shigeng; Zhao, Fan; He, You; Lu, Huchuan.

IEEE Trans Image Process ; 32: 3108-3120, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37220043

RESUMO

Both salient object detection (SOD) and camouflaged object detection (COD) are typical object segmentation tasks. They are intuitively contradictory, but are intrinsically related. In this paper, we explore the relationship between SOD and COD, and then borrow successful SOD models to detect camouflaged objects to save the design cost of COD models. The core insight is that both SOD and COD leverage two aspects of information: object semantic representations for distinguishing object and background, and context attributes that decide object category. Specifically, we start by decoupling context attributes and object semantic representations from both SOD and COD datasets through designing a novel decoupling framework with triple measure constraints. Then, we transfer saliency context attributes to the camouflaged images through introducing an attribute transfer network. The generated weakly camouflaged images can bridge the context attribute gap between SOD and COD, thereby improving the SOD models' performances on COD datasets. Comprehensive experiments on three widely-used COD datasets verify the ability of the proposed method. Code and model are available at: https://github.com/wdzhao123/SAT.

15.

High-Performance Transformer Tracking.

Chen, Xin; Yan, Bin; Zhu, Jiawen; Lu, Huchuan; Ruan, Xiang; Wang, Dong.

IEEE Trans Pattern Anal Mach Intell ; 45(7): 8507-8523, 2023 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-37015509

RESUMO

Correlation has a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion method that considers the similarity between the template and the search region. However, the correlation operation is a local linear matching process, losing semantic information and easily falling into a local optimum, which may be the bottleneck in designing high-accuracy tracking algorithms. In this work, to determine whether a better feature fusion method exists than correlation, a novel attention-based feature fusion network, inspired by the transformer, is presented. This network effectively combines the template and search region features using attention mechanism. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. First, we present a transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression heads. Based on the TransT baseline, we also design a segmentation branch to generate the accurate mask. Finally, we propose a stronger version of TransT by extending it with a multi-template scheme and an IoU prediction head, named TransT-M. Experiments show that our TransT and TransT-M methods achieve promising results on seven popular benchmarks. Code and models are available at https://github.com/chenxin-dlut/TransT-M.

16.

Depth Injection Framework for RGBD Salient Object Detection.

Yao, Shunyu; Zhang, Miao; Piao, Yongri; Qiu, Chaoyi; Lu, Huchuan.

IEEE Trans Image Process ; 32: 5340-5352, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37729570

RESUMO

Depth data with a predominance of discriminative power in location is advantageous for accurate salient object detection (SOD). Existing RGBD SOD methods have focused on how to properly use depth information for complementary fusion with RGB data, having achieved great success. In this work, we attempt a far more ambitious use of the depth information by injecting the depth maps into the encoder in a single-stream model. Specifically, we propose a depth injection framework (DIF) equipped with an Injection Scheme (IS) and a Depth Injection Module (DIM). The proposed IS enhances the semantic representation of the RGB features in the encoder by directly injecting depth maps into the high-level encoder blocks, while helping our model maintain computational convenience. Our proposed DIM acts as a bridge between the depth maps and the hierarchical RGB features of the encoder and helps the information of two modalities complement and guide each other, contributing to a great fusion effect. Experimental results demonstrate that our proposed method can achieve state-of-the-art performance on six RGBD datasets. Moreover, our method can achieve excellent performance on RGBT SOD and our DIM can be easily applied to single-stream SOD models and the transformer architecture, proving a powerful generalization ability.

17.

Real-Time Semantic Segmentation via a Densely Aggregated Bilateral Network.

Yang, Shu; Zhang, Lu; Liu, Shuai; Lu, Huchuan; Chen, Hao.

IEEE Trans Neural Netw Learn Syst ; PP2023 Nov 01.

Artigo em Inglês | MEDLINE | ID: mdl-37910414

RESUMO

With the growing demands of applications on online devices, the speed-accuracy trade-off is critical in the semantic segmentation system. Recently, the bilateral segmentation network has shown promising capacity to achieve the balance between favorable accuracy and fast speed, and has become the mainstream backbone in real-time semantic segmentation. Segmentation of target objects relies on high-level semantics, whereas it requires detailed low-level features to model specific local patterns for accurate location. However, the lightweight backbone of bilateral architecture limits the extraction of semantic context and spatial details. And the late fusion of the bilateral streams incurs the insufficient aggregation of semantic context and spatial details. In this article, we propose a densely aggregated bilateral network (DAB-Net) for real-time semantic segmentation. In the context path, a patchwise context enhancement (PCE) module is proposed to efficiently capture the local semantic contextual information from spatialwise and channelwise, respectively. Meanwhile, a context-guided spatial path (CGSP) is designed to exploit more spatial information by encoding finer details from the raw image and the transition from the context path. Finally, with multiple interactions between bilateral branches, the intertwined outputs from bilateral streams are combined in a unified decoder for a final interaction to further enhance the feature representation, which generates the final segmentation prediction. Experimental results on three public benchmarks demonstrate that our proposed method achieves higher accuracy with a limited decay in speed, which performs favorably against state-of-the-art real-time approaches and runs at 31.1 frames/s (FPS) on the high resolution of [Formula: see text] . The source code is released at https://github.com/isyangshu/DABNet.

18.

Plug-and-Play Regulators for Image-Text Matching.

Diao, Haiwen; Zhang, Ying; Liu, Wei; Ruan, Xiang; Lu, Huchuan.

IEEE Trans Image Process ; 32: 2322-2334, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37071519

RESUMO

Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are "plug-and-play": both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods.

19.

Deformable Dynamic Sampling and Dynamic Predictable Mask Mining for Image Inpainting.

Shang, Cai; Zeng, Yu; Yang, Shu; Jia, Xu; Lu, Huchuan; He, You.

IEEE Trans Neural Netw Learn Syst ; PP2023 Oct 26.

Artigo em Inglês | MEDLINE | ID: mdl-37883252

RESUMO

Existing image inpainting methods often produce artifacts that are caused by using vanilla convolution layers as building blocks that treat all image regions equally and generate holes at random locations with equal probability. This design does not differentiate the missing regions and valid regions in inference and does not consider the predictability of missing regions in training. To address these issues, we propose a deformable dynamic sampling (DDS) mechanism which is built on deformable convolutions (DCs), and a constraint is proposed to avoid the deformably sampled elements falling into the corrupted regions. Furthermore, to select both valid sample locations and suitable kernels dynamically, we equip DCs with content-aware dynamic kernel selection (DKS). In addition, to further encourage the DDS mechanism to find meaningful sampling locations, we propose to train the inpainting model with mined predictable regions as holes. During training, we jointly train a mask generator with the inpainting network to generate hole masks dynamically for each training sample. Thus, the mask generator can find large yet predictable missing regions as a better alternative to random masks. Extensive experiments demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively.

20.

Self-Supervised Tracking via Target-Aware Data Synthesis.

Li, Xin; Pei, Wenjie; Wang, Yaowei; He, Zhenyu; Lu, Huchuan; Yang, Ming-Hsuan.

IEEE Trans Neural Netw Learn Syst ; PP2023 Jan 02.

Artigo em Inglês | MEDLINE | ID: mdl-37018296

RESUMO

While deep-learning-based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised (SS) learning for visual tracking. In this work, we develop the crop-transform-paste operation, which is able to synthesize sufficient training data by simulating various appearance variations during tracking, including appearance variations of objects and background interference. Since the target state is known in all synthesized data, existing deep trackers can be trained in routine ways using the synthesized data without human annotation. The proposed target-aware data-synthesis method adapts existing tracking approaches within a SS learning framework without algorithmic changes. Thus, the proposed SS learning mechanism can be seamlessly integrated into existing tracking frameworks to perform training. Extensive experiments show that our method: 1) achieves favorable performance against supervised (Su) learning schemes under the cases with limited annotations; 2) helps deal with various tracking challenges such as object deformation, occlusion (OCC), or background clutter (BC) due to its manipulability; 3) performs favorably against the state-of-the-art unsupervised tracking methods; and 4) boosts the performance of various state-of-the-art Su learning frameworks, including SiamRPN++, DiMP, and TransT.

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA