Búsqueda | Portal Regional de la BVS Paraguay

Teaching Masked Autoencoder With Strong Augmentations.

Zhu, Rui; Bai, Yalong; Yao, Ting; Liu, Jingen; Sun, Zhenglong; Mei, Tao; Chen, Chang Wen.

IEEE Trans Neural Netw Learn Syst ; PP2024 Jul 09.

Artículo en Inglés | MEDLINE | ID: mdl-38980783

RESUMEN

Masked autoencoder (MAE) has been regarded as a capable self-supervised learner for various downstream tasks. Nevertheless, the model still lacks high-level discriminability, which results in poor linear probing performance. In view of the fact that strong augmentation plays an essential role in contrastive learning, can we capitalize on strong augmentation in MAE? The difficulty originates from the pixel uncertainty caused by strong augmentation that may affect the reconstruction, and thus, directly introducing strong augmentation into MAE often hurts the performance. In this article, we delve into the potential of strong augmented views to enhance MAE while maintaining MAE's advantages. To this end, we propose a simple yet effective masked Siamese autoencoder (MSA) model, which consists of a student branch and a teacher branch. The student branch derives MAE's advanced architecture, and the teacher branch treats the unmasked strong view as an exemplary teacher to impose high-level discrimination onto the student branch. We demonstrate that our MSA can improve the model's spatial perception capability and, therefore, globally favors interimage discrimination. Empirical evidence shows that the model pretrained by MSA provides superior performances across different downstream tasks. Notably, linear probing performance on frozen features extracted from MSA leads to 6.1% gains over MAE on ImageNet-1k. Fine-tuning (FT) the network on VQAv2 task finally achieves 67.4% accuracy, outperforming 1.6% of the supervised method DeiT and 1.2% of MAE. Codes and models are available at https://github.com/KimSoybean/MSA.

Learning Self-Corrective Network via Adaptive Self-Labeling and Dynamic NMS for High-Performance Long-Term Tracking.

Zhang, Zhibin; Xue, Wanli; Zhang, Kaihua; Liu, Bo; Zhang, Chengwei; Liu, Jingen; Chen, Shengyong.

IEEE Trans Neural Netw Learn Syst ; PP2023 Nov 07.

Artículo en Inglés | MEDLINE | ID: mdl-37934642

RESUMEN

This article presents a self-corrective network-based long-term tracker (SCLT) including a self-modulated tracking reliability evaluator (STRE) and a self-adjusting proposal postprocessor (SPPP). The targets in the long-term sequences often suffer from severe appearance variations. Existing long-term trackers often online update their models to adapt the variations, but the inaccurate tracking results introduce cumulative error into the updated model that may cause severe drift issue. To this end, a robust long-term tracker should have the self-corrective capability that can judge whether the tracking result is reliable or not, and then it is able to recapture the target when severe drift happens caused by serious challenges (e.g., full occlusion and out-of-view). To address the first issue, the STRE designs an effective tracking reliability classifier that is built on a modulation subnetwork. The classifier is trained using the samples with pseudo labels generated by an adaptive self-labeling strategy. The adaptive self-labeling can automatically label the hard negative samples that are often neglected in existing trackers according to the statistical characteristics of target state, and the network modulation mechanism can guide the backbone network to learn more discriminative features without extra training data. To address the second issue, after the STRE has been triggered, the SPPP follows it with a dynamic NMS to recapture the target in time and accurately. In addition, the STRE and the SPPP demonstrate good transportability ability, and their performance is improved when combined with multiple baselines. Compared to the commonly used greedy NMS, the proposed dynamic NMS leverages an adaptive strategy to effectively handle the different conditions of in view and out of view, thereby being able to select the most probable object box that is essential to accurately online update the basic tracker. Extensive evaluations on four large-scale and challenging benchmark datasets including VOT2021LT, OxUvALT, TLP, and LaSOT demonstrate superiority of the proposed SCLT to a variety of state-of-the-art long-term trackers in terms of all measures. Source codes and demos can be found at https://github.com/TJUT-CV/SCLT.

CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation.

Wang, Xiao; Liu, Jingen; Mei, Tao; Luo, Jiebo.

IEEE Trans Neural Netw Learn Syst ; PP2023 May 04.

Artículo en Inglés | MEDLINE | ID: mdl-37141054

RESUMEN

Some cognitive research has discovered that humans accomplish event segmentation as a side effect of event anticipation. Inspired by this discovery, we propose a simple yet effective end-to-end self-supervised learning framework for event segmentation/boundary detection. Unlike the mainstream clustering-based methods, our framework exploits a transformer-based feature reconstruction scheme to detect event boundaries by reconstruction errors. This is consistent with the fact that humans spot new events by leveraging the deviation between their prediction and what is perceived. Thanks to their heterogeneity in semantics, the frames at boundaries are difficult to be reconstructed (generally with large reconstruction errors), which is favorable for event boundary detection. In addition, since the reconstruction occurs on the semantic feature level instead of the pixel level, we develop a temporal contrastive feature embedding (TCFE) module to learn the semantic visual representation for frame feature reconstruction (FFR). This procedure is like humans building up experiences with "long-term memory." The goal of our work is to segment generic events rather than localize some specific ones. We focus on achieving accurate event boundaries. As a result, we adopt the F1 score (Precision/Recall) as our primary evaluation metric for a fair comparison with previous approaches. Meanwhile, we also calculate the conventional frame-based mean over frames (MoF) and intersection over union (IoU) metric. We thoroughly benchmark our work on four publicly available datasets and demonstrate much better results. The source code is available at https://github.com/wang3702/CoSeg.

Multi-Task Siamese Network for Retinal Artery/Vein Separation via Deep Convolution Along Vessel.

Wang, Zhiwei; Jiang, Xixi; Liu, Jingen; Cheng, Kwang-Ting; Yang, Xin.

IEEE Trans Med Imaging ; 39(9): 2904-2919, 2020 09.

Artículo en Inglés | MEDLINE | ID: mdl-32167888

RESUMEN

Vascular tree disentanglement and vessel type classification are two crucial steps of the graph-based method for retinal artery-vein (A/V) separation. Existing approaches treat them as two independent tasks and mostly rely on ad hoc rules (e.g. change of vessel directions) and hand-crafted features (e.g. color, thickness) to handle them respectively. However, we argue that the two tasks are highly correlated and should be handled jointly since knowing the A/V type can unravel those highly entangled vascular trees, which in turn helps to infer the types of connected vessels that are hard to classify based on only appearance. Therefore, designing features and models isolatedly for the two tasks often leads to a suboptimal solution of A/V separation. In view of this, this paper proposes a multi-task siamese network which aims to learn the two tasks jointly and thus yields more robust deep features for accurate A/V separation. Specifically, we first introduce Convolution Along Vessel (CAV) to extract the visual features by convolving a fundus image along vessel segments, and the geometric features by tracking the directions of blood flow in vessels. The siamese network is then trained to learn multiple tasks: i) classifying A/V types of vessel segments using visual features only, and ii) estimating the similarity of every two connected segments by comparing their visual and geometric features in order to disentangle the vasculature into individual vessel trees. Finally, the results of two tasks mutually correct each other to accomplish final A/V separation. Experimental results demonstrate that our method can achieve accuracy values of 94.7%, 96.9%, and 94.5% on three major databases (DRIVE, INSPIRE, WIDE) respectively, which outperforms recent state-of-the-arts.

Asunto(s)

Arteria Retiniana , Vena Retiniana , Algoritmos , Fondo de Ojo , Retina , Vena Retiniana/diagnóstico por imagen , Vasos Retinianos/diagnóstico por imagen

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA