Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 5192-5208, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38354071

RESUMEN

Previous work for video captioning aims to objectively describe the video content but the captions lack human interest and attractiveness, limiting its practical application scenarios. The intention of video title generation (video titling) is to produce attractive titles, but there is a lack of benchmarks. This work offers CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration dataset, to assist research and applications in video titling, video captioning, and video retrieval in Chinese. CREATE comprises a high-quality labeled 210 K dataset and two web-scale 3 M and 10 M pre-training datasets, covering 51 categories, 50K+ tags, 537K+ manually annotated titles and captions, and 10M+ short videos with original video information. This work presents ACTEr, a unique Attractiveness-Consensus-based Title Evaluation, to objectively evaluate the quality of video title generation. This metric measures the semantic correlation between the candidate (model-generated title) and references (manual-labeled titles) and introduces attractive consensus weights to assess the attractiveness and relevance of the video title. Accordingly, this work proposes a novel multi-modal ALignment WIth Generation model, ALWIG, as one strong baseline to aid future model development. With the help of a tag-driven video-text alignment module and a GPT-based generation module, this model achieves video titling, captioning, and retrieval simultaneously. We believe that the release of the CREATE dataset, ACTEr metric, and ALWIG model will encourage in-depth research on the analysis and creation of Chinese short videos.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 12377-12393, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37294644

RESUMEN

Blind image inpainting involves two critical aspects, i.e., "where to inpaint" and "how to inpaint". Knowing "where to inpaint" can eliminate the interference arising from corrupted pixel values; a good "how to inpaint" strategy yields high-quality inpainted results robust to various corruptions. In existing methods, these two aspects usually lack explicit and separate consideration. This paper fully explores these two aspects and proposes a self-prior guided inpainting network (SIN). The self-priors are obtained by detecting semantic-discontinuous regions and by predicting global semantic structures of the input image. On the one hand, the self-priors are incorporated into the SIN, which enables the SIN to perceive valid context information from uncorrupted regions and to synthesize semantic-aware textures for corrupted regions. On the other hand, the self-priors are reformulated to provide a pixel-wise adversarial feedback and a high-level semantic structure feedback, which can promote the semantic continuity of inpainted images. Experimental results demonstrate that our method achieves state-of-the-art performance in metric scores and in visual quality. It has an advantage over many existing methods that assume "where to inpaint" is known in advance. Extensive experiments on a series of related image restoration tasks validate the effectiveness of our method in obtaining high-quality inpainting.

3.
Neural Netw ; 162: 393-411, 2023 May.
Artículo en Inglés | MEDLINE | ID: mdl-36958245

RESUMEN

The conventional Relation Extraction (RE) task involves identifying whether relations exist between two entities in a given sentence and determining their relation types. However, the complexity of practical application scenarios and the flexibility of natural language demand the ability to extract nested relations, i.e., the recognized relation triples may be components of the higher-level relations. Previous studies have highlighted several challenges that affect the nested RE task, including the lack of abundant labeled data, inappropriate neural networks, and underutilization of the nested relation structures. To address these issues, we formalize the nested RE task and propose a hierarchical neural network to iteratively identify the nested relations between entities and relation triples in a layer by layer manner. Moreover, a novel self-contrastive learning optimization strategy is presented to adapt our method to low-data settings by fully exploiting the constraints due to the nested structure and semantic similarity between paired input sentences. Our method outperformed the state-of-the-art baseline methods in extensive experiments, and ablation experiments verified the effectiveness of the proposed self-contrastive learning optimization strategy.


Asunto(s)
Lenguaje , Semántica , Redes Neurales de la Computación , Aprendizaje , Procesamiento de Lenguaje Natural
4.
Artículo en Inglés | MEDLINE | ID: mdl-35901002

RESUMEN

Recent one-stage object detectors follow a per-pixel prediction approach that predicts both the object category scores and boundary positions from every single grid location. However, the most suitable positions for inferring different targets, i.e., the object category and boundaries, are generally different. Predicting all these targets from the same grid location thus may lead to sub-optimal results. In this paper, we analyze the suitable inference positions for object category and boundaries, and propose a prediction-target-decoupled detector named PDNet to establish a more flexible detection paradigm. Our PDNet with the prediction decoupling mechanism encodes different targets separately in different locations. A learnable prediction collection module is devised with two sets of dynamic points, i.e., dynamic boundary points and semantic points, to collect and aggregate the predictions from the favorable regions for localization and classification. We adopt a two-step strategy to learn these dynamic point positions, where the prior positions are estimated for different targets first, and the network further predicts residual offsets to the positions with better perceptions of the object properties. Extensive experiments on the MS COCO benchmark demonstrate the effectiveness and efficiency of our method. With a single ResNeXt-64x4d-101-DCN as the backbone, our detector achieves 50.1 AP with single-scale testing, which outperforms the state-of-the-art methods by an appreciable margin under the same experimental settings. Moreover, our detector is highly efficient as a one-stage framework. Our code will be public.

5.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 7010-7028, 2022 10.
Artículo en Inglés | MEDLINE | ID: mdl-34314355

RESUMEN

For CNN-based visual action recognition, the accuracy may be increased if local key action regions are focused on. The task of self-attention is to focus on key features and ignore irrelevant information. So, self-attention is useful for action recognition. However, current self-attention methods usually ignore correlations among local feature vectors at spatial positions in CNN feature maps. In this paper, we propose an effective interaction-aware self-attention model which can extract information about the interactions between feature vectors to learn attention maps. Since the different layers in a network capture feature maps at different scales, we introduce a spatial pyramid with the feature maps at different layers for attention modeling. The multi-scale information is utilized to obtain more accurate attention scores. These attention scores are used to weight the local feature vectors of the feature maps and then calculate attentional feature maps. Since the number of feature maps input to the spatial pyramid attention layer is unrestricted, we easily extend this attention layer to a spatio-temporal version. Our model can be embedded in any general CNN to form a video-level end-to-end attention network for action recognition. Several methods are investigated to combine the RGB and flow streams to obtain accurate predictions of human actions. Experimental results show that our method achieves state-of-the-art results on the datasets UCF101, HMDB51, Kinetics-400, and untrimmed Charades.


Asunto(s)
Algoritmos , Aprendizaje , Humanos
6.
IEEE Trans Image Process ; 30: 6050-6065, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34185642

RESUMEN

For weakly supervised object localization (WSOL), how to avoid the network focusing only on some small discriminative parts is a main challenge needed to solve. The widely-used Class Activation Mapping (CAM) based paradigm usually employs Adversarial Learning (AL) strategy to search more object parts by constantly hiding discovered object features, but the adversarial process is difficult to control. In this paper, we propose a novel CAM-based framework with Multi-scale Low-Discriminative Feature Reactivation (mLDFR) for WSOL. The mLDFR framework reactivates the low-discriminative object parts via bottom-up continuous feature maps recalibration and multi-scale object category mapping. Compared with the AL-based methods, our method fully improves the localization power of the network without damaging the classification power and can perform multi-instance localization, which are hard to achieve under the AL-based framework. Moreover, the mLDFR framework is flexible, and can be built on the top of various classical CNN backbones. Experimental results demonstrate the superiority of our method. With VGG16 as backbone, we achieve 46.96% Cls-Loc top1 err and 66.12% CorLoc on ILSVRC2014, 38.07% Cls-Loc top1 err and 75.04% CorLoc on CUB200-2011, surpassing the state-of-the-arts by a large margin.

7.
IEEE Trans Neural Netw Learn Syst ; 32(10): 4499-4513, 2021 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-33136545

RESUMEN

Model compression methods have become popular in recent years, which aim to alleviate the heavy load of deep neural networks (DNNs) in real-world applications. However, most of the existing compression methods have two limitations: 1) they usually adopt a cumbersome process, including pretraining, training with a sparsity constraint, pruning/decomposition, and fine-tuning. Moreover, the last three stages are usually iterated multiple times. 2) The models are pretrained under explicit sparsity or low-rank assumptions, which are difficult to guarantee wide appropriateness. In this article, we propose an efficient decomposition and pruning (EDP) scheme via constructing a compressed-aware block that can automatically minimize the rank of the weight matrix and identify the redundant channels. Specifically, we embed the compressed-aware block by decomposing one network layer into two layers: a new weight matrix layer and a coefficient matrix layer. By imposing regularizers on the coefficient matrix, the new weight matrix learns to become a low-rank basis weight, and its corresponding channels become sparse. In this way, the proposed compressed-aware block simultaneously achieves low-rank decomposition and channel pruning by only one single data-driven training stage. Moreover, the network of architecture is further compressed and optimized by a novel Pruning & Merging (PM) module which prunes redundant channels and merges redundant decomposed layers. Experimental results (17 competitors) on different data sets and networks demonstrate that the proposed EDP achieves a high compression ratio with acceptable accuracy degradation and outperforms state-of-the-arts on compression rate, accuracy, inference time, and run-time memory.

8.
Artículo en Inglés | MEDLINE | ID: mdl-32286989

RESUMEN

Convolutional neural networks are built upon simple but useful convolution modules. The traditional convolution has a limitation on feature extraction and object localization due to its fixed scale and geometric structure. Besides, the loss of spatial information also restricts the networks' performance and depth. To overcome these limitations, this paper proposes a novel anisotropic convolution by adding a scale factor and a shape factor into the traditional convolution. The anisotropic convolution augments the receptive fields flexibly and dynamically depending on the valid sizes of objects. In addition, the anisotropic convolution is a generalized convolution. The traditional convolution, dilated convolution and deformable convolution can be viewed as its special cases. Furthermore, in order to improve the training efficiency and avoid falling into a local optimum, this paper introduces a simplified implementation of the anisotropic convolution. The anisotropic convolution can be applied to arbitrary convolutional networks and the enhanced networks are called ACNs (anisotropic convolutional networks). Experimental results show that ACNs achieve better performance than many state-of-the-art methods and the baseline networks in tasks of image classification and object localization, especially in classification task of tiny images.

9.
Artículo en Inglés | MEDLINE | ID: mdl-32275599

RESUMEN

Convolutional Neural Networks have achieved excellent successes for object recognition in still images. However, the improvement of Convolutional Neural Networks over the traditional methods for recognizing actions in videos is not so significant, because the raw videos usually have much more redundant or irrelevant information than still images. In this paper, we propose a Spatial-Temporal Attentive Convolutional Neural Network (STA-CNN) which selects the discriminative temporal segments and focuses on the informative spatial regions automatically. The STA-CNN model incorporates a Temporal Attention Mechanism and a Spatial Attention Mechanism into a unified convolutional network to recognize actions in videos. The novel Temporal Attention Mechanism automatically mines the discriminative temporal segments from long and noisy videos. The Spatial Attention Mechanism firstly exploits the instantaneous motion information in optical flow features to locate the motion salient regions and it is then trained by an auxiliary classification loss with a Global Average Pooling layer to focus on the discriminative non-motion regions in the video frame. The STA-CNN model achieves the state-of-the-art performance on two of the most challenging datasets, UCF-101 (95.8%) and HMDB-51 (71.5%).

10.
Artículo en Inglés | MEDLINE | ID: mdl-29994476

RESUMEN

Graphs are effective tools for modeling complex data. Setting out from two basic substructures, random walks and trees, we propose a new family of context-dependent random walk graph kernels and a new family of tree pattern graph matching kernels. In our context-dependent graph kernels, context information is incorporated into primary random walk groups. A multiple kernel learning algorithm with a proposed l1,2-norm regularization is applied to combine context-dependent graph kernels of different orders. This improves the similarity measurement between graphs. In our tree-pattern graph matching kernel, a quadratic optimization with a sparse constraint is proposed to select the correctly matched tree-pattern groups. This augments the discriminative power of the tree-pattern graph matching. We apply the proposed kernels to human action recognition, where each action is represented by two graphs which record the spatiotemporal relations between local feature vectors. Experimental comparisons with state-of-the-art algorithms on several benchmark datasets demonstrate the effectiveness of the proposed kernels for recognizing human actions. It is shown that our kernel based on tree-pattern groups, which have more complex structures and exploit more local topologies of graphs than random walks, yields more accurate results but requires more runtime than the context-dependent walk graph kernel.

11.
IEEE Trans Pattern Anal Mach Intell ; 40(10): 2355-2373, 2018 10.
Artículo en Inglés | MEDLINE | ID: mdl-28952936

RESUMEN

In this paper, a new nonparametric Bayesian model called the dual sticky hierarchical Dirichlet process hidden Markov model (HDP-HMM) is proposed for mining activities from a collection of time series data such as trajectories. All the time series data are clustered. Each cluster of time series data, corresponding to a motion pattern, is modeled by an HMM. Our model postulates a set of HMMs that share a common set of states (topics in an analogy with topic models for document processing), but have unique transition distributions. The number of HMMs and the number of topics are both automatically determined. The sticky prior avoids redundant states and makes our HDP-HMM more effective to model multimodal observations. For the application to motion trajectory modeling, topics correspond to motion activities. The learnt topics are clustered into atomic activities which are assigned predicates. We propose a Bayesian inference method to decompose a given trajectory into a sequence of atomic activities. The sources and sinks in the scene are learnt by clustering endpoints (origins and destinations) of trajectories. The semantic motion regions are learnt using the points in trajectories. On combining the learnt sources and sinks, the learnt semantic motion regions, and the learnt sequence of atomic activities, the action represented by a trajectory can be described in natural language in as automatic a way as possible. The effectiveness of our dual sticky HDP-HMM is validated on several trajectory datasets. The effectiveness of the natural language descriptions for motions is demonstrated on the vehicle trajectories extracted from a traffic scene.

12.
IEEE Trans Pattern Anal Mach Intell ; 39(12): 2554-2560, 2017 12.
Artículo en Inglés | MEDLINE | ID: mdl-28212079

RESUMEN

In multi-instance learning (MIL), the relations among instances in a bag convey important contextual information in many applications. Previous studies on MIL either ignore such relations or simply model them with a fixed graph structure so that the overall performance inevitably degrades in complex environments. To address this problem, this paper proposes a novel multi-view multi-instance learning algorithm (MIL) that combines multiple context structures in a bag into a unified framework. The novel aspects are: (i) we propose a sparse -graph model that can generate different graphs with different parameters to represent various context relations in a bag, (ii) we propose a multi-view joint sparse representation that integrates these graphs into a unified framework for bag classification, and (iii) we propose a multi-view dictionary learning algorithm to obtain a multi-view graph dictionary that considers cues from all views simultaneously to improve the discrimination of the MIL. Experiments and analyses in many practical applications prove the effectiveness of the M IL.

13.
IEEE Trans Image Process ; 23(2): 570-81, 2014 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-26270909

RESUMEN

In this paper, we propose using high-level action units to represent human actions in videos and, based on such units, a novel sparse model is developed for human action recognition. There are three interconnected components in our approach. First, we propose a new context-aware spatial-temporal descriptor, named locally weighted word context, to improve the discriminability of the traditionally used local spatial-temporal descriptors. Second, from the statistics of the context-aware descriptors, we learn action units using the graph regularized nonnegative matrix factorization, which leads to a part-based representation and encodes the geometrical information. These units effectively bridge the semantic gap in action recognition. Third, we propose a sparse model based on a joint l2,1-norm to preserve the representative items and suppress noise in the action units. Intuitively, when learning the dictionary for action representation, the sparse model captures the fact that actions from the same class share similar units. The proposed approach is evaluated on several publicly available data sets. The experimental results and analysis clearly demonstrate the effectiveness of the proposed approach.


Asunto(s)
Interpretación de Imagen Asistida por Computador/métodos , Actividad Motora/fisiología , Movimiento/fisiología , Reconocimiento de Normas Patrones Automatizadas/métodos , Técnica de Sustracción , Imagen de Cuerpo Entero/métodos , Algoritmos , Humanos , Aumento de la Imagen/métodos , Imagenología Tridimensional/métodos , Fotograbar/métodos , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
14.
IEEE Trans Image Process ; 23(2): 658-72, 2014 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-26270910

RESUMEN

In this paper, we present a new geometric-temporal representation for visual action recognition based on local spatio-temporal features. First, we propose a modified covariance descriptor under the log-Euclidean Riemannian metric to represent the spatio-temporal cuboids detected in the video sequences. Compared with previously proposed covariance descriptors, our descriptor can be measured and clustered in Euclidian space. Second, to capture the geometric-temporal contextual information, we construct a directional pyramid co-occurrence matrix (DPCM) to describe the spatio-temporal distribution of the vector-quantized local feature descriptors extracted from a video. DPCM characterizes the co-occurrence statistics of local features as well as the spatio-temporal positional relationships among the concurrent features. These statistics provide strong descriptive power for action recognition. To use DPCM for action recognition, we propose a directional pyramid co-occurrence matching kernel to measure the similarity of videos. The proposed method achieves the state-of-the-art performance and improves on the recognition performance of the bag-of-visual-words (BOVWs) models by a large margin on six public data sets. For example, on the KTH data set, it achieves 98.78% accuracy while the BOVW approach only achieves 88.06%. On both Weizmann and UCF CIL data sets, the highest possible accuracy of 100% is achieved.

15.
IEEE Trans Pattern Anal Mach Intell ; 36(12): 2466-82, 2014 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-26353152

RESUMEN

In this paper, we address the problem of human action recognition through combining global temporal dynamics and local visual spatio-temporal appearance features. For this purpose, in the global temporal dimension, we propose to model the motion dynamics with robust linear dynamical systems (LDSs) and use the model parameters as motion descriptors. Since LDSs live in a non-Euclidean space and the descriptors are in non-vector form, we propose a shift invariant subspace angles based distance to measure the similarity between LDSs. In the local visual dimension, we construct curved spatio-temporal cuboids along the trajectories of densely sampled feature points and describe them using histograms of oriented gradients (HOG). The distance between motion sequences is computed with the Chi-Squared histogram distance in the bag-of-words framework. Finally we perform classification using the maximum margin distance learning method by combining the global dynamic distances and the local visual distances. We evaluate our approach for action recognition on five short clips data sets, namely Weizmann, KTH, UCF sports, Hollywood2 and UCF50, as well as three long continuous data sets, namely VIRAT, ADL and CRIM13. We show competitive results as compared with current state-of-the-art methods.


Asunto(s)
Algoritmos , Procesamiento de Imagen Asistido por Computador/métodos , Actividad Motora/fisiología , Reconocimiento de Normas Patrones Automatizadas/métodos , Humanos , Aprendizaje Automático , Análisis Espacio-Temporal , Grabación en Video
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...