Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Artículo en Inglés | MEDLINE | ID: mdl-39159040

RESUMEN

Many machine learning problems can be formulated as non-convex multi-player games. Due to non-convexity, it is challenging to obtain the existence condition of the global Nash equilibrium (NE) and design theoretically guaranteed algorithms. This paper studies a class of non-convex multi-player games, where players' payoff functions consist of canonical functions and quadratic operators. We leverage conjugate properties to transform the complementary problem into a variational inequality (VI) problem using a continuous pseudo-gradient mapping. We prove the existence condition of the global NE as the solution to the VI problem satisfies a duality relation. We then design an ordinary differential equation to approach the global NE with an exponential convergence rate. For practical implementation, we derive a discretized algorithm and apply it to two scenarios: multi-player games with generalized monotonicity and multi-player potential games. In the two settings, step sizes are required to be O(1/k) and O(1/√k) to yield the convergence rates of O(1/ k) and O(1/√k), respectively. Extensive experiments on robust neural network training and sensor network localization validate our theory. Our code is available at https://github.com/GuanpuChen/Global-NE.

2.
Neural Netw ; 179: 106539, 2024 Jul 17.
Artículo en Inglés | MEDLINE | ID: mdl-39089149

RESUMEN

Significant progress has been achieved in multi-object tracking (MOT) through the evolution of detection and re-identification (ReID) techniques. Despite these advancements, accurately tracking objects in scenarios with homogeneous appearance and heterogeneous motion remains a challenge. This challenge arises from two main factors: the insufficient discriminability of ReID features and the predominant utilization of linear motion models in MOT. In this context, we introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor that relies solely on object trajectory information. This predictor comprehensively integrates two levels of granularity in motion features to enhance the modeling of temporal dynamics and facilitate precise future motion prediction for individual objects. Specifically, the proposed approach adopts a self-attention mechanism to capture token-level information and a Dynamic MLP layer to model channel-level features. MotionTrack is a simple, online tracking approach. Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT, characterized by highly complex object motion.

3.
Artículo en Inglés | MEDLINE | ID: mdl-38829763

RESUMEN

Transformers, originally devised for natural language processing (NLP), have also produced significant successes in computer vision (CV). Due to their strong expression power, researchers are investigating ways to deploy transformers for reinforcement learning (RL), and transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances concerning the transformation of RL with transformers (transformer-based RL (TRL)) to explore the development trajectory and future trends of this field. We group the existing developments into two categories: architecture enhancements and trajectory optimizations, and examine the main applications of TRL in robotic manipulation, text-based games (TBGs), navigation, and autonomous driving. Architecture enhancement methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, facilitating more precise modeling of agents and environments compared to traditional deep RL techniques. However, these methods are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and the "deadly triad". Trajectory optimization methods treat RL problems as sequence modeling problems and train a joint state-action model over entire trajectories under the behavior cloning framework; such approaches are able to extract policies from static datasets and fully use the long-sequence modeling capabilities of transformers. Given these advancements, the limitations and challenges in TRL are reviewed and proposals regarding future research directions are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.

4.
Artículo en Inglés | MEDLINE | ID: mdl-38885108

RESUMEN

Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance. However, the process of collecting and labeling such data can be expensive and time-consuming. Self-supervised learning (SSL), a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying on human-annotated labels. SSL has garnered significant attention recently, leading to the development of numerous related algorithms. However, there is a dearth of comprehensive studies that elucidate the connections and evolution of different SSL variants. This paper presents a review of diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions. Firstly, we provide a detailed introduction to the motivations behind most SSL algorithms and compare their commonalities and differences. Secondly, we explore representative applications of SSL in domains such as image processing, computer vision, and natural language processing. Lastly, we discuss the three primary trends observed in SSL research and highlight the open questions that remain. A curated collection of valuable resources can be accessed at https://github.com/guijiejie/SSL.

5.
Artículo en Inglés | MEDLINE | ID: mdl-38743541

RESUMEN

Federated learning (FL) aims to collaboratively learn a model by using the data from multiple users under privacy constraints. In this article, we study the multilabel classification (MLC) problem under the FL setting, where trivial solution and extremely poor performance may be obtained, especially when only positive data with respect to a single class label is provided for each client. This issue can be addressed by adding a specially designed regularizer on the server side. Although effective sometimes, the label correlations are simply ignored and thus suboptimal performance may be obtained. Besides, it is expensive and unsafe to exchange user's private embeddings between server and clients frequently, especially when training model in the contrastive way. To remedy these drawbacks, we propose a novel and generic method termed federated averaging (FedAvg) by exploring label correlations (FedALCs). Specifically, FedALC estimates the label correlations in the class embedding learning for different label pairs and utilizes it to improve the model training. To further improve the safety and also reduce the communication overhead, we propose a variant to learn fixed class embedding for each client, so that the server and clients only need to exchange class embeddings once. Extensive experiments on multiple popular datasets demonstrate that our FedALC can significantly outperform the existing counterparts.

6.
Nat Commun ; 15(1): 3716, 2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38697959

RESUMEN

Entanglement serves as the resource to empower quantum computing. Recent progress has highlighted its positive impact on learning quantum dynamics, wherein the integration of entanglement into quantum operations or measurements of quantum machine learning (QML) models leads to substantial reductions in training data size, surpassing a specified prediction error threshold. However, an analytical understanding of how the entanglement degree in data affects model performance remains elusive. In this study, we address this knowledge gap by establishing a quantum no-free-lunch (NFL) theorem for learning quantum dynamics using entangled data. Contrary to previous findings, we prove that the impact of entangled data on prediction error exhibits a dual effect, depending on the number of permitted measurements. With a sufficient number of measurements, increasing the entanglement of training data consistently reduces the prediction error or decreases the required size of the training data to achieve the same prediction error. Conversely, when few measurements are allowed, employing highly entangled data could lead to an increased prediction error. The achieved results provide critical guidance for designing advanced QML protocols, especially for those tailored for execution on early-stage quantum computers with limited access to quantum resources.

7.
IEEE Trans Image Process ; 33: 2714-2729, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38557629

RESUMEN

Billions of people share images from their daily lives on social media every day. However, their biometric information (e.g., fingerprints) could be easily stolen from these images. The threat of fingerprint leakage from social media has created a strong desire to anonymize shared images while maintaining image quality, since fingerprints act as a lifelong individual biometric password. To guard the fingerprint leakage, adversarial attack that involves adding imperceptible perturbations to fingerprint images have emerged as a feasible solution. However, existing works of this kind are either weak in black-box transferability or cause the images to have an unnatural appearance. Motivated by the visual perception hierarchy (i.e., high-level perception exploits model-shared semantics that transfer well across models while low-level perception extracts primitive stimuli that result in high visual sensitivity when a suspicious stimulus is provided), we propose FingerSafe, a hierarchical perceptual protective noise injection framework to address the above mentioned problems. For black-box transferability, we inject protective noises into the fingerprint orientation field to perturb the model-shared high-level semantics (i.e., fingerprint ridges). Considering visual naturalness, we suppress the low-level local contrast stimulus by regularizing the response of the Lateral Geniculate Nucleus. Our proposed FingerSafe is the first to provide feasible fingerprint protection in both digital (up to 94.12%) and realistic scenarios (Twitter and Facebook, up to 68.75%). Our code can be found at https://github.com/nlsde-safety-team/FingerSafe.


Asunto(s)
Medios de Comunicación Sociales , Humanos , Dermatoglifia , Privacidad , Percepción Visual
8.
IEEE Trans Image Process ; 33: 1782-1794, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38442064

RESUMEN

Referring Image Segmentation (RIS) is a fundamental vision-language task that outputs object masks based on text descriptions. Many works have achieved considerable progress for RIS, including different fusion method designs. In this work, we explore an essential question, "What if the text description is wrong or misleading?" For example, the described objects are not in the image. We term such a sentence as a negative sentence. However, existing solutions for RIS cannot handle such a setting. To this end, we propose a new formulation of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the negative sentence inputs besides the regular positive text inputs. To facilitate this new task, we create three R-RIS datasets by augmenting existing RIS datasets with negative sentences and propose new metrics to evaluate both types of inputs in a unified manner. Furthermore, we propose a new transformer-based model, called RefSegformer, with a token-based vision and language fusion module. Our design can be easily extended to our R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves state-of-the-art results on both RIS and R-RIS datasets, establishing a solid baseline for both settings. Our project page is at https://github.com/jianzongwu/robust-ref-seg.

9.
Artículo en Inglés | MEDLINE | ID: mdl-38324433

RESUMEN

This article studies the generalization of neural networks (NNs) by examining how a network changes when trained on a training sample with or without out-of-distribution (OoD) examples. If the network's predictions are less influenced by fitting OoD examples, then the network learns attentively from the clean training set. A new notion, dataset-distraction stability, is proposed to measure the influence. Extensive CIFAR-10/100 experiments on the different VGG, ResNet, WideResNet, ViT architectures, and optimizers show a negative correlation between the dataset-distraction stability and generalizability. With the distraction stability, we decompose the learning process on the training set S into multiple learning processes on the subsets of S drawn from simpler distributions, i.e., distributions of smaller intrinsic dimensions (IDs), and furthermore, a tighter generalization bound is derived. Through attentive learning, miraculous generalization in deep learning can be explained and novel algorithms can also be designed.

10.
Artículo en Inglés | MEDLINE | ID: mdl-38393837

RESUMEN

With recent success of deep learning in 2-D visual recognition, deep-learning-based 3-D point cloud analysis has received increasing attention from the community, especially due to the rapid development of autonomous driving technologies. However, most existing methods directly learn point features in the spatial domain, leaving the local structures in the spectral domain poorly investigated. In this article, we introduce a new method, PointWavelet, to explore local graphs in the spectral domain via a learnable graph wavelet transform. Specifically, we first introduce the graph wavelet transform to form multiscale spectral graph convolution to learn effective local structural representations. To avoid the time-consuming spectral decomposition, we then devise a learnable graph wavelet transform, which significantly accelerates the overall training process. Extensive experiments on four popular point cloud datasets, ModelNet40, ScanObjectNN, ShapeNet-Part, and S3DIS, demonstrate the effectiveness of the proposed method on point cloud classification and segmentation.

11.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 5092-5113, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38315601

RESUMEN

In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective than weakly supervised and zero-shot settings. This paper thoroughly reviews open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by juxtaposing open vocabulary learning with analogous concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Subsequently, we examine several pertinent tasks within the realms of segmentation and detection, encompassing long-tail problems, few-shot, and zero-shot settings. As a foundation for our method survey, we first elucidate the fundamental principles of detection and segmentation in close-set scenarios. Next, we examine various contexts where open vocabulary learning is employed, pinpointing recurring design elements and central themes. This is followed by a comparative analysis of recent detection and segmentation methodologies in commonly used datasets and benchmarks. Our review culminates with a synthesis of insights, challenges, and discourse on prospective research trajectories. To our knowledge, this constitutes the inaugural exhaustive literature review on open vocabulary learning.

12.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3608-3624, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38190690

RESUMEN

Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at QFormer.

13.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3910-3922, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38241113

RESUMEN

Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. However, modeling global correlations with multi-head self-attention (MSA) layers leads to two widely recognized issues: the massive computational resource consumption and the lack of intrinsic inductive bias for modeling local visual patterns. To solve both issues, we devise a simple yet effective method named Single-Path Vision Transformer pruning (SPViT), to efficiently and automatically compress the pre-trained ViTs into compact models with proper locality added. Specifically, we first propose a novel weight-sharing scheme between MSA and convolutional operations, delivering a single-path space to encode all candidate operations. In this way, we cast the operation search problem as finding which subset of parameters to use in each MSA layer, which significantly reduces the computational cost and optimization difficulty, and the convolution kernels can be well initialized using pre-trained MSA parameters. Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers. Similarly, we further employ learnable gates to encode the fine-grained MLP expansion ratios of FFN layers. In this way, our SPViT optimizes the learnable gates to automatically explore from a vast and unified search space and flexibly adjust the MSA-FFN pruning proportions for each individual dense model. We conduct extensive experiments on two representative ViTs showing that our SPViT achieves a new SOTA for pruning on ImageNet-1 k. For example, our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.

14.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4443-4459, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38227418

RESUMEN

Factorization machines (FMs) are widely used in recommender systems due to their adaptability and ability to learn from sparse data. However, for the ubiquitous non-interactive features in sparse data, existing FMs can only estimate the parameters corresponding to these features via the inner product of their embeddings. Undeniably, they cannot learn the direct interactions of these features, which limits the model's expressive power. To this end, we first present MixFM, inspired by Mixup, to generate auxiliary training data to boost FMs. Unlike existing augmentation strategies that require labor costs and expertise to collect additional information such as position and fields, these augmented data are only by the convex combination of the raw ones without any professional knowledge support. More importantly, if non-interactive features exist in parent samples to be mixed respectively, MixFM will establish their direct interactions. Second, considering that MixFM may generate redundant or even detrimental instances, we further put forward a novel Factorization Machine powered by Saliency-guided Mixup (denoted as SMFM). Guided by the customized saliency, SMFM can generate more informative neighbor data. Through theoretical analysis, we prove that the proposed methods minimize the upper bound of the generalization error, which positively enhances FMs. Finally, extensive experiments on seven datasets confirm that our approaches are superior to baselines. Notably, the results also show that "poisoning" mixed data benefits the FM variants.

15.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 4850-4865, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38261483

RESUMEN

Although stereo image restoration has been extensively studied, most existing work focuses on restoring stereo images with limited horizontal parallax due to the binocular symmetry constraint. Stereo images with unlimited parallax (e.g., large ranges and asymmetrical types) are more challenging in real-world applications and have rarely been explored so far. To restore high-quality stereo images with unlimited parallax, this paper proposes an attention-guided correspondence learning method, which learns both self- and cross-views feature correspondence guided by parallax and omnidirectional attention. To learn cross-view feature correspondence, a Selective Parallax Attention Module (SPAM) is proposed to interact with cross-view features under the guidance of parallax attention that adaptively selects receptive fields for different parallax ranges. Furthermore, to handle asymmetrical parallax, we propose a Non-local Omnidirectional Attention Module (NOAM) to learn the non-local correlation of both self- and cross-view contexts, which guides the aggregation of global contextual features. Finally, we propose an Attention-guided Correspondence Learning Restoration Network (ACLRNet) upon SPAMs and NOAMs to restore stereo images by associating the features of two views based on the learned correspondence. Extensive experiments on five benchmark datasets demonstrate the effectiveness and generalization of the proposed method on three stereo image restoration tasks including super-resolution, denoising, and compression artifact reduction.

16.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 4551-4566, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38133979

RESUMEN

Information Bottleneck (IB) provides an information-theoretic principle for multi-view learning by revealing the various components contained in each viewpoint. This highlights the necessity to capture their distinct roles to achieve view-invariance and predictive representations but remains under-explored due to the technical intractability of modeling and organizing innumerable mutual information (MI) terms. Recent studies show that sufficiency and consistency play such key roles in multi-view representation learning, and could be preserved via a variational distillation framework. But when it generalizes to arbitrary viewpoints, such strategy fails as the mutual information terms of consistency become complicated. This paper presents Multi-View Variational Distillation (MV 2D), tackling the above limitations for generalized multi-view learning. Uniquely, MV 2D can recognize useful consistent information and prioritize diverse components by their generalization ability. This guides an analytical and scalable solution to achieving both sufficiency and consistency. Additionally, by rigorously reformulating the IB objective, MV 2D tackles the difficulties in MI optimization and fully realizes the theoretical advantages of the information bottleneck principle. We extensively evaluate our model on diverse tasks to verify its effectiveness, where the considerable gains provide key insights into achieving generalized multi-view representations under a rigorous information-theoretic principle.

17.
IEEE Trans Cybern ; 54(7): 4138-4149, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38150342

RESUMEN

Commonsense reasoning based on knowledge graphs (KGs) is a challenging task that requires predicting complex questions over the described textual contexts and relevant knowledge about the world. However, current methods typically assume clean training scenarios with accurately labeled samples, which are often unrealistic. The training set can include mislabeled samples, and the robustness to label noises is essential for commonsense reasoning methods to be practical, but this problem remains largely unexplored. This work focuses on commonsense reasoning with mislabeled training samples and makes several technical contributions: 1) we first construct diverse augmentations from knowledge and model, and offer a simple yet effective multiple-choice alignment method to divide the training samples into clean, semi-clean, and unclean parts; 2) we design adaptive label correction methods for the semi-clean and unclean samples to exploit the supervised potential of noisy information; and 3) finally, we extensively test these methods on noisy versions of commonsense reasoning benchmarks (CommonsenseQA and OpenbookQA). Experimental results show that the proposed method can significantly enhance robustness and improve overall performance. Furthermore, the proposed method is generally applicable to multiple existing commonsense reasoning frameworks to boost their robustness. The code is available at https://github.com/xdxuyang/CR_Noisy_Labels.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA