Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 52
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
IEEE Trans Cybern ; PP2024 May 21.
Artículo en Inglés | MEDLINE | ID: mdl-38771679

RESUMEN

Temporal knowledge graphs (TKGs) are receiving increased attention due to their time-dependent properties and the evolving nature of knowledge over time. TKGs typically contain complex geometric structures, such as hierarchical, ring, and chain structures, which can often be mixed together. However, embedding TKGs into Euclidean space, as is typically done with TKG completion (TKGC) models, presents a challenge when dealing with high-dimensional nonlinear data and complex geometric structures. To address this issue, we propose a novel TKGC model called multicurvature adaptive embedding (MADE). MADE models TKGs in multicurvature spaces, including flat Euclidean space (zero curvature), hyperbolic space (negative curvature), and hyperspherical space (positive curvature), to handle multiple geometric structures. We assign different weights to different curvature spaces in a data-driven manner to strengthen the ideal curvature spaces for modeling and weaken the inappropriate ones. Additionally, we introduce the quadruplet distributor (QD) to assist the information interaction in each geometric space. Ultimately, we develop an innovative temporal regularization to enhance the smoothness of timestamp embeddings by strengthening the correlation of neighboring timestamps. Experimental results show that MADE outperforms the existing state-of-the-art TKGC models.

2.
Neural Netw ; 176: 106322, 2024 Apr 16.
Artículo en Inglés | MEDLINE | ID: mdl-38653128

RESUMEN

In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi-modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.

3.
Artículo en Inglés | MEDLINE | ID: mdl-38446647

RESUMEN

The objective of visual question answering (VQA) is to adequately comprehend a question and identify relevant contents in an image that can provide an answer. Existing approaches in VQA often combine visual and question features directly to create a unified cross-modality representation for answer inference. However, this kind of approach fails to bridge the semantic gap between visual and text modalities, resulting in a lack of alignment in cross-modality semantics and the inability to match key visual content accurately. In this article, we propose a model called the caption bridge-based cross-modality alignment and contrastive learning model (CBAC) to address the issue. The CBAC model aims to reduce the semantic gap between different modalities. It consists of a caption-based cross-modality alignment module and a visual-caption (V-C) contrastive learning module. By utilizing an auxiliary caption that shares the same modality as the question and has closer semantic associations with the visual, we are able to effectively reduce the semantic gap by separately matching the caption with both the question and the visual to generate pre-alignment features for each, which are then used in the subsequent fusion process. We also leverage the fact that V-C pairs exhibit stronger semantic connections compared to question-visual (Q-V) pairs to employ a contrastive learning mechanism on visual and caption pairs to further enhance the semantic alignment capabilities of single-modality encoders. Extensive experiments conducted on three benchmark datasets demonstrate that the proposed model outperforms previous state-of-the-art VQA models. Additionally, ablation experiments confirm the effectiveness of each module in our model. Furthermore, we conduct a qualitative analysis by visualizing the attention matrices to assess the reasoning reliability of the proposed model.

4.
Artículo en Inglés | MEDLINE | ID: mdl-38300769

RESUMEN

Attribute graphs are a crucial data structure for graph communities. However, the presence of redundancy and noise in the attribute graph can impair the aggregation effect of integrating two different heterogeneous distributions of attribute and structural features, resulting in inconsistent and distorted data that ultimately compromises the accuracy and reliability of attribute graph learning. For instance, redundant or irrelevant attributes can result in overfitting, while noisy attributes can lead to underfitting. Similarly, redundant or noisy structural features can affect the accuracy of graph representations, making it challenging to distinguish between different nodes or communities. To address these issues, we propose the embedded fusion graph auto-encoder framework for self-supervised learning (SSL), which leverages multitask learning to fuse node features across different tasks to reduce redundancy. The embedding fusion graph auto-encoder (EFGAE) framework comprises two phases: pretraining (PT) and downstream task learning (DTL). During the PT phase, EFGAE uses a graph auto-encoder (GAE) based on adversarial contrastive learning to learn structural and attribute embeddings separately and then fuses these embeddings to obtain a representation of the entire graph. During the DTL phase, we introduce an adaptive graph convolutional network (AGCN), which is applied to graph neural network (GNN) classifiers to enhance recognition for downstream tasks. The experimental results demonstrate that our approach outperforms state-of-the-art (SOTA) techniques in terms of accuracy, generalization ability, and robustness.

5.
IEEE Trans Image Process ; 33: 1059-1069, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38265894

RESUMEN

This paper presents a novel fine-grained task for traffic accident analysis. Accident detection in surveillance or dashcam videos is a common task in the field of traffic accident analysis by using videos. However, common accident detection does not analyze the specific particulars of the accident, only identifies the accident's existence or occurrence time in a video. In this paper, we define the novel fine-grained accident detection task which contains fine-grained accident classification, temporal-spatial occurrence region localization, and accident severity estimation. A transformer-based framework combining the RGB and optical flow information of videos is proposed for fine-grained accident detection. Additionally, we introduce a challenging Fine-grained Accident Detection (FAD) database that covers multiple tasks in surveillance videos which places more emphasis on the overall perspective. Experimental results demonstrate that our model could effectively extract the video features for multiple tasks, indicating that current traffic accident analysis has limitations in dealing with the FAD task and that further research is indeed needed.

6.
Neural Netw ; 169: 1-10, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37852165

RESUMEN

Graph Neural Networks (GNNs) have emerged as a crucial deep learning framework for graph-structured data. However, existing GNNs suffer from the scalability limitation, which hinders their practical implementation in industrial settings. Many scalable GNNs have been proposed to address this limitation. However, they have been proven to act as low-pass graph filters, which discard the valuable middle- and high-frequency information. This paper proposes a novel graph neural network named Adaptive Filtering Graph Neural Networks (AFGNN), which can capture all frequency information on large-scale graphs. AFGNN consists of two stages. The first stage utilizes low-, middle-, and high-pass graph filters to extract comprehensive frequency information without introducing additional parameters. This computation is a one-time task and is pre-computed before training, ensuring its scalability. The second stage incorporates a node-level attention-based feature combination, enabling the generation of customized graph filters for each node, contrary to existing spectral GNNs that employ uniform graph filters for the entire graph. AFGNN is suitable for mini-batch training, and can enhance scalability and efficiently capture all frequency information from large-scale graphs. We evaluate AFGNN by comparing its ability to capture all frequency information with spectral GNNs, and its scalability with scalable GNNs. Experimental results illustrate that AFGNN surpasses both scalable GNNs and spectral GNNs, highlighting its superiority.


Asunto(s)
Redes Neurales de la Computación
7.
IEEE Trans Image Process ; 32: 6129-6141, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37889807

RESUMEN

Event cameras, or dynamic vision sensors, have recently achieved success from fundamental vision tasks to high-level vision researches. Due to its ability to asynchronously capture light intensity changes, event camera has an inherent advantage to capture moving objects in challenging scenarios including objects under low light, high dynamic range, or fast moving objects. Thus event camera are natural for visual object tracking. However, the current event-based trackers derived from RGB trackers simply modify the input images to event frames and still follow conventional tracking pipeline that mainly focus on object texture for target distinction. As a result, the trackers may not be robust dealing with challenging scenarios such as moving cameras and cluttered foreground. In this paper, we propose a distractor-aware event-based tracker that introduces transformer modules into Siamese network architecture (named DANet). Specifically, our model is mainly composed of a motion-aware network and a target-aware network, which simultaneously exploits both motion cues and object contours from event data, so as to discover motion objects and identify the target object by removing dynamic distractors. Our DANet can be trained in an end-to-end manner without any post-processing and can run at over 80 FPS on a single V100. We conduct comprehensive experiments on two large event tracking datasets to validate the proposed model. We demonstrate that our tracker has superior performance against the state-of-the-art trackers in terms of both accuracy and efficiency.

8.
Artículo en Inglés | MEDLINE | ID: mdl-37695953

RESUMEN

The effective modal fusion and perception between the language and the image are necessary for inferring the reference instance in the referring image segmentation (RIS) task. In this article, we propose a novel RIS network, the global and local interactive perception network (GLIPN), to enhance the quality of modal fusion between the language and the image from the local and global perspectives. The core of GLIPN is the global and local interactive perception (GLIP) scheme. Specifically, the GLIP scheme contains the local perception module (LPM) and the global perception module (GPM). The LPM is designed to enhance the local modal fusion by the correspondence between word and image local semantics. The GPM is designed to inject the global structured semantics of images into the modal fusion process, which can better guide the word embedding to perceive the whole image's global structure. Combined with the local-global context semantics fusion, extensive experiments on several benchmark datasets demonstrate the advantage of the proposed GLIPN over most state-of-the-art approaches.

9.
Neural Netw ; 165: 1010-1020, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37467583

RESUMEN

To learn the embedding representation of graph structure data corrupted by noise and outliers, existing graph structure learning networks usually follow the two-step paradigm, i.e., constructing a "good" graph structure and achieving the message passing for signals supported on the learned graph. However, the data corrupted by noise may make the learned graph structure unreliable. In this paper, we propose an adaptive graph convolutional clustering network that alternatively adjusts the graph structure and node representation layer-by-layer with back-propagation. Specifically, we design a Graph Structure Learning layer before each Graph Convolutional layer to learn the sparse graph structure from the node representations, where the graph structure is implicitly determined by the solution to the optimal self-expression problem. This is one of the first works that uses an optimization process as a Graph Network layer, which is obviously different from the function operation in traditional deep learning layers. An efficient iterative optimization algorithm is given to solve the optimal self-expression problem in the Graph Structure Learning layer. Experimental results show that the proposed method can effectively defend the negative effects of inaccurate graph structures. The code is available at https://github.com/HeXiax/SSGNN.


Asunto(s)
Algoritmos , Análisis por Conglomerados
10.
Artículo en Inglés | MEDLINE | ID: mdl-37459264

RESUMEN

Structured clustering networks, which alleviate the oversmoothing issue by delivering hidden features from autoencoder (AE) to graph convolutional networks (GCNs), involve two shortcomings for the clustering task. For one thing, they used vanilla structure to learn clustering representations without considering feature and structure corruption; for another thing, they exhibit network degradation and vanishing gradient issues after stacking multilayer GCNs. In this article, we propose a clustering method called dual-masked deep structural clustering network (DMDSC) with adaptive bidirectional information delivery (ABID). Specifically, DMDSC enables generative self-supervised learning to mine deeper interstructure and interfeature correlations by simultaneously reconstructing corrupted structures and features. Furthermore, DMDSC develops an ABID module to establish an information transfer channel between each pairwise layer of AE and GCNs to alleviate the oversmoothing and vanishing gradient problems. Numerous experiments on six benchmark datasets have shown that the proposed DMDSC outperforms the most advanced deep clustering algorithms.

11.
Artículo en Inglés | MEDLINE | ID: mdl-37224351

RESUMEN

Temporal knowledge graph completion (TKGC) is an extension of the traditional static knowledge graph completion (SKGC) by introducing the timestamp. The existing TKGC methods generally translate the original quadruplet to the form of the triplet by integrating the timestamp into the entity/relation, and then use SKGC methods to infer the missing item. However, such an integrating operation largely limits the expressive ability of temporal information and ignores the semantic loss problem due to the fact that entities, relations, and timestamps are located in different spaces. In this article, we propose a novel TKGC method called the quadruplet distributor network (QDN), which independently models the embeddings of entities, relations, and timestamps in their specific spaces to fully capture the semantics and builds the QD to facilitate the information aggregation and distribution among them. Furthermore, the interaction among entities, relations, and timestamps is integrated using a novel quadruplet-specific decoder, which stretches the third-order tensor to the fourth-order to satisfy the TKGC criterion. Equally important, we design a novel temporal regularization that imposes a smoothness constraint on temporal embeddings. Experimental results show that the proposed method outperforms the existing state-of-the-art TKGC methods. The source codes of this article are available at https://github.com/QDN for Temporal Knowledge Graph Completion.git.

12.
Sensors (Basel) ; 23(7)2023 Mar 30.
Artículo en Inglés | MEDLINE | ID: mdl-37050670

RESUMEN

Detecting salient objects in complicated scenarios is a challenging problem. Except for semantic features from the RGB image, spatial information from the depth image also provides sufficient cues about the object. Therefore, it is crucial to rationally integrate RGB and depth features for the RGB-D salient object detection task. Most existing RGB-D saliency detectors modulate RGB semantic features with absolution depth values. However, they ignore the appearance contrast and structure knowledge indicated by relative depth values between pixels. In this work, we propose a depth-induced network (DIN) for RGB-D salient object detection, to take full advantage of both absolute and relative depth information, and further, enforce the in-depth fusion of the RGB-D cross-modalities. Specifically, an absolute depth-induced module (ADIM) is proposed, to hierarchically integrate absolute depth values and RGB features, to allow the interaction between the appearance and structural information in the encoding stage. A relative depth-induced module (RDIM) is designed, to capture detailed saliency cues, by exploring contrastive and structural information from relative depth values in the decoding stage. By combining the ADIM and RDIM, we can accurately locate salient objects with clear boundaries, even from complex scenes. The proposed DIN is a lightweight network, and the model size is much smaller than that of state-of-the-art algorithms. Extensive experiments on six challenging benchmarks, show that our method outperforms most existing RGB-D salient object detection models.

13.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 11079-11095, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-37018106

RESUMEN

We present a deep reinforcement learning method of progressive view inpainting for colored semantic point cloud scene completion under volume guidance, achieving high-quality scene reconstruction from only a single RGB-D image with severe occlusion. Our approach is end-to-end, consisting of three modules: 3D scene volume reconstruction, 2D RGB-D and segmentation image inpainting, and multi-view selection for completion. Given a single RGB-D image, our method first predicts its semantic segmentation map and goes through the 3D volume branch to obtain a volumetric scene reconstruction as a guide to the next view inpainting step, which attempts to make up the missing information; the third step involves projecting the volume under the same view of the input, concatenating them to complete the current view RGB-D and segmentation map, and integrating all RGB-D and segmentation maps into the point cloud. Since the occluded areas are unavailable, we resort to a A3C network to glance around and pick the next best view for large hole completion progressively until a scene is adequately reconstructed while guaranteeing validity. All steps are learned jointly to achieve robust and consistent results. We perform qualitative and quantitative evaluations with extensive experiments on the 3D-FUTURE data, obtaining better results than state-of-the-arts.

14.
IEEE Trans Vis Comput Graph ; 29(12): 5556-5568, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-36367917

RESUMEN

3D scene graph generation (SGG) has been of high interest in computer vision. Although the accuracy of 3D SGG on coarse classification and single relation label has been gradually improved, the performance of existing works is still far from being perfect for fine-grained and multi-label situations. In this article, we propose a framework fully exploring contextual information for the 3D SGG task, which attempts to satisfy the requirements of fine-grained entity class, multiple relation labels, and high accuracy simultaneously. Our proposed approach is composed of a Graph Feature Extraction module and a Graph Contextual Reasoning module, achieving appropriate information-redundancy feature extraction, structured organization, and hierarchical inferring. Our approach achieves superior or competitive performance over previous methods on the 3DSSG dataset, especially on the relationship prediction sub-task.

15.
Neural Netw ; 158: 305-317, 2023 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-36493533

RESUMEN

Graph convolutional networks (GCNs) have become a popular tool for learning unstructured graph data due to their powerful learning ability. Many researchers have been interested in fusing topological structures and node features to extract the correlation information for classification tasks. However, it is inadequate to integrate the embedding from topology and feature spaces to gain the most correlated information. At the same time, most GCN-based methods assume that the topology graph or feature graph is compatible with the properties of GCNs, but this is usually not satisfied since meaningless, missing, or even unreal edges are very common in actual graphs. To obtain a more robust and accurate graph structure, we intend to construct an adaptive graph with topology and feature graphs. We propose Multi-graph Fusion Graph Convolutional Networks with pseudo-label supervision (MFGCN), which learn a connected embedding by fusing the multi-graphs and node features. We can obtain the final node embedding for semi-supervised node classification by propagating node features over multi-graphs. Furthermore, to alleviate the problem of labels missing in semi-supervised classification, a pseudo-label generation mechanism is proposed to generate more reliable pseudo-labels based on the similarity of node features. Extensive experiments on six benchmark datasets demonstrate the superiority of MFGCN over state-of-the-art classification methods.


Asunto(s)
Benchmarking , Inteligencia , Aprendizaje
16.
IEEE Trans Neural Netw Learn Syst ; 34(10): 8071-8085, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-35767491

RESUMEN

Long document classification (LDC) has been a focused interest in natural language processing (NLP) recently with the exponential increase of publications. Based on the pretrained language models, many LDC methods have been proposed and achieved considerable progression. However, most of the existing methods model long documents as sequences of text while omitting the document structure, thus limiting the capability of effectively representing long texts carrying structure information. To mitigate such limitation, we propose a novel hierarchical graph convolutional network (HGCN) for structured LDC in this article, in which a section graph network is proposed to model the macrostructure of a document and a word graph network with a decoupled graph convolutional block is designed to extract the fine-grained features of a document. In addition, an interaction strategy is proposed to integrate these two networks as a whole by propagating features between them. To verify the effectiveness of the proposed model, four structured long document datasets are constructed, and the extensive experiments conducted on these datasets and another unstructured dataset show that the proposed method outperforms the state-of-the-art related classification methods.

17.
IEEE Trans Neural Netw Learn Syst ; 34(12): 10309-10323, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-35442894

RESUMEN

This article presents a new text-to-image (T2I) generation model, named distribution regularization generative adversarial network (DR-GAN), to generate images from text descriptions from improved distribution learning. In DR-GAN, we introduce two novel modules: a semantic disentangling module (SDM) and a distribution normalization module (DNM). SDM combines the spatial self-attention mechanism (SSAM) and a new semantic disentangling loss (SDL) to help the generator distill key semantic information for the image generation. DNM uses a variational auto-encoder (VAE) to normalize and denoise the image latent distribution, which can help the discriminator better distinguish synthesized images from real images. DNM also adopts a distribution adversarial loss (DAL) to guide the generator to align with normalized real image distributions in the latent space. Extensive experiments on two public datasets demonstrated that our DR-GAN achieved a competitive performance in the T2I task. The code link: https://github.com/Tan-H-C/DR-GAN-Distribution-Regularization-for-Text-to-Image-Generation.

18.
IEEE Trans Neural Netw Learn Syst ; 34(11): 8210-8224, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-35312622

RESUMEN

This article presents a novel person reidentification model, named multihead self-attention network (MHSA-Net), to prune unimportant information and capture key local information from person images. MHSA-Net contains two main novel components: multihead self-attention branch (MHSAB) and attention competition mechanism (ACM). The MHSAB adaptively captures key local person information and then produces effective diversity embeddings of an image for the person matching. The ACM further helps filter out attention noise and nonkey information. Through extensive ablation studies, we verified that the MHSAB and ACM both contribute to the performance improvement of the MHSA-Net. Our MHSA-Net achieves competitive performance in the standard and occluded person Re-ID tasks.

19.
IEEE Trans Neural Netw Learn Syst ; 34(10): 7196-7209, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-35061594

RESUMEN

Domain adaptation in the Euclidean space is a challenging task on which researchers recently have made great progress. However, in practice, there are rich data representations that are not Euclidean. For example, many high-dimensional data in computer vision are in general modeled by a low-dimensional manifold. This prompts the demand of exploring domain adaptation between non-Euclidean manifold spaces. This article is concerned with domain adaption over the classic Grassmann manifolds. An optimal transport-based domain adaptation model on Grassmann manifolds has been proposed. The model implements the adaption between datasets by minimizing the Wasserstein distances between the projected source data and the target data on Grassmann manifolds. Four regularization terms are introduced to keep task-related consistency in the adaptation process. Furthermore, to reduce the computational cost, a simplified model preserving the necessary adaption property and its efficient algorithm is proposed and tested. The experiments on several publicly available datasets prove the proposed model outperforms several relevant baseline domain adaptation methods.

20.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3396-3410, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-35648873

RESUMEN

The low-rank tensor could characterize inner structure and explore high-order correlation among multi-view representations, which has been widely used in multi-view clustering. Existing approaches adopt the tensor nuclear norm (TNN) as a convex approximation of non-convex tensor rank function. However, TNN treats the different singular values equally and over-penalizes the main rank components, leading to sub-optimal tensor representation. In this paper, we devise a better surrogate of tensor rank, namely the tensor logarithmic Schatten- p norm ([Formula: see text]N), which fully considers the physical difference between singular values by the non-convex and non-linear penalty function. Further, a tensor logarithmic Schatten- p norm minimization ([Formula: see text]NM)-based multi-view subspace clustering ([Formula: see text]NM-MSC) model is proposed. Specially, the proposed [Formula: see text]NM can not only protect the larger singular values encoded with useful structural information, but also remove the smaller ones encoded with redundant information. Thus, the learned tensor representation with compact low-rank structure will well explore the complementary information and accurately characterize the high-order correlation among multi-views. The alternating direction method of multipliers (ADMM) is used to solve the non-convex multi-block [Formula: see text]NM-MSC model where the challenging [Formula: see text]NM problem is carefully handled. Importantly, the algorithm convergence analysis is mathematically established by showing that the sequence generated by the algorithm is of Cauchy and converges to a Karush-Kuhn-Tucker (KKT) point. Experimental results on nine benchmark databases reveal the superiority of the [Formula: see text]NM-MSC model.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...