Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 49
Filtrar
1.
Artículo en Inglés | MEDLINE | ID: mdl-38598385

RESUMEN

Motion mapping between characters with different structures but corresponding to homeomorphic graphs, meanwhile preserving motion semantics and perceiving shape geometries, poses significant challenges in skinned motion retargeting. We propose M-R2ET, a modular neural motion retargeting system to comprehensively address these challenges. The key insight driving M-R2ET is its capacity to learn residual motion modifications within a canonical skeleton space. Specifically, a cross-structure alignment module is designed to learn joint correspondences among diverse skeletons, enabling motion copy and forming a reliable initial motion for semantics and geometry perception. Besides, two residual modification modules, i.e., the skeleton-aware module and shape-aware module, preserving source motion semantics and perceiving target character geometries, effectively reduce interpenetration and contact-missing. Driven by our distance-based losses that explicitly model the semantics and geometry, these two modules learn residual motion modifications to the initial motion in a single inference without post-processing. To balance these two motion modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our M-R2ET achieves the state-of-the-art performance, enabling cross-structure motion retargeting, and providing a good balance among the preservation of motion semantics as well as the attenuation of interpenetration and contact-missing.

2.
Cell ; 2024 Apr 10.
Artículo en Inglés | MEDLINE | ID: mdl-38631355

RESUMEN

Precise control of gene expression levels is essential for normal cell functions, yet how they are defined and tightly maintained, particularly at intermediate levels, remains elusive. Here, using a series of newly developed sequencing, imaging, and functional assays, we uncover a class of transcription factors with dual roles as activators and repressors, referred to as condensate-forming level-regulating dual-action transcription factors (TFs). They reduce high expression but increase low expression to achieve stable intermediate levels. Dual-action TFs directly exert activating and repressing functions via condensate-forming domains that compartmentalize core transcriptional unit selectively. Clinically relevant mutations in these domains, which are linked to a range of developmental disorders, impair condensate selectivity and dual-action TF activity. These results collectively address a fundamental question in expression regulation and demonstrate the potential of level-regulating dual-action TFs as powerful effectors for engineering controlled expression levels.

3.
Artículo en Inglés | MEDLINE | ID: mdl-37729565

RESUMEN

This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB + D 120, respectively. The source code is available at https://github.com/tangent-T/FACL.

4.
IEEE Trans Image Process ; 32: 3507-3520, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37335800

RESUMEN

Recognizing human actions in dark videos is a useful yet challenging visual task in reality. Existing augmentation-based methods separate action recognition and dark enhancement in a two-stage pipeline, which leads to inconsistently learning of temporal representation for action recognition. To address this issue, we propose a novel end-to-end framework termed Dark Temporal Consistency Model (DTCM), which is able to jointly optimize dark enhancement and action recognition, and force the temporal consistency to guide downstream dark feature learning. Specifically, DTCM cascades the action classification head with the dark augmentation network to perform dark video action recognition in a one-stage pipeline. Our explored spatio-temporal consistency loss, which utilizes the RGB-Difference of dark video frames to encourage temporal coherence of the enhanced video frames, is effective for boosting spatio-temporal representation learning. Extensive experiments demonstrated that our DTCM has remarkable performance: 1) Competitive accuracy, which outperforms the state-of-the-arts on the ARID dataset by 2.32% and the UAVHuman-Fisheye dataset by 4.19% in accuracy, respectively; 2) High efficiency, which surpasses the current most advanced method (Chen et al., 2021) with only 6.4% GFLOPs and 71.3% number of parameters; 3) Strong generalization, which can be used in various action recognition methods (e.g., TSM, I3D, 3D-ResNext-101, Video-Swin) to promote their performance significantly.


Asunto(s)
Algoritmos , Reconocimiento de Normas Patrones Automatizadas , Humanos , Grabación en Video , Reconocimiento de Normas Patrones Automatizadas/métodos , Aprendizaje , Actividades Humanas
5.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9469-9485, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37027607

RESUMEN

We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that the detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Accordingly, in this work, we propose S2HAND, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and explore S2HAND(V), which uses a set of weights shared S2HAND to process each frame and exploits additional motion, texture, and shape consistency constrains to obtain more accurate hand poses, and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised method produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using the video training data.


Asunto(s)
Algoritmos , Benchmarking , Señales (Psicología) , Movimiento (Física) , Aprendizaje Automático Supervisado
6.
Artículo en Inglés | MEDLINE | ID: mdl-37027699

RESUMEN

Visually exploring in a real-world 4D spatiotemporal space freely in VR has been a long-term quest. The task is especially appealing when only a few or even single RGB cameras are used for capturing the dynamic scene. To this end, we present an efficient framework capable of fast reconstruction, compact modeling, and streamable rendering. First, we propose to decompose the 4D spatiotemporal space according to temporal characteristics. Points in the 4D space are associated with probabilities of belonging to three categories: static, deforming, and new areas. Each area is represented and regularized by a separate neural field. Second, we propose a hybrid representations based feature streaming scheme for efficiently modeling the neural fields. Our approach, coined NeRFPlayer, is evaluated on dynamic scenes captured by single hand-held cameras and multi-camera arrays, achieving comparable or superior rendering performance in terms of quality and speed comparable to recent state-of-the-art methods, achieving reconstruction in 10 seconds per frame and interactive rendering. Project website: https://bit.ly/nerfplayer.

7.
IEEE Trans Med Imaging ; 42(7): 2057-2067, 2023 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-36215346

RESUMEN

Federated Learning (FL) is a machine learning paradigm where many local nodes collaboratively train a central model while keeping the training data decentralized. This is particularly relevant for clinical applications since patient data are usually not allowed to be transferred out of medical facilities, leading to the need for FL. Existing FL methods typically share model parameters or employ co-distillation to address the issue of unbalanced data distribution. However, they also require numerous rounds of synchronized communication and, more importantly, suffer from a privacy leakage risk. We propose a privacy-preserving FL framework leveraging unlabeled public data for one-way offline knowledge distillation in this work. The central model is learned from local knowledge via ensemble attention distillation. Our technique uses decentralized and heterogeneous local data like existing FL approaches, but more importantly, it significantly reduces the risk of privacy leakage. We demonstrate that our method achieves very competitive performance with more robust privacy preservation based on extensive experiments on image classification, segmentation, and reconstruction tasks.


Asunto(s)
Aprendizaje Automático , Privacidad , Humanos
8.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 4136-4151, 2023 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-35816538

RESUMEN

Weakly-supervised temporal action localization (W-TAL) aims to classify and localize all action instances in untrimmed videos under only video-level supervision. Without frame-level annotations, it is challenging for W-TAL methods to clearly distinguish actions and background, which severely degrades the action boundary localization and action proposal scoring. In this paper, we present an adaptive two-stream consensus network (A-TSCN) to address this problem. Our A-TSCN features an iterative refinement training scheme: a frame-level pseudo ground truth is generated and iteratively updated from a late-fusion activation sequence, and used to provide frame-level supervision for improved model training. Besides, we introduce an adaptive attention normalization loss, which adaptively selects action and background snippets according to video attention distribution. By differentiating the attention values of the selected action snippets and background snippets, it forces the predicted attention to act as a binary selection and promotes the precise localization of action boundaries. Furthermore, we propose a video-level and a snippet-level uncertainty estimator, and they can mitigate the adverse effect caused by learning from noisy pseudo ground truth. Experiments conducted on the THUMOS14, ActivityNet v1.2, ActivityNet v1.3, and HACS datasets show that our A-TSCN outperforms current state-of-the-art methods, and even achieves comparable performance with several fully-supervised methods.

9.
IEEE Trans Image Process ; 31: 4104-4116, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35687626

RESUMEN

Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, which require a costly multi-layer network to handle each rate, or by hierarchically sampling backbone features, which rely heavily on high-level features that miss fine-grained temporal dynamics. In this work, we propose a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably. Specifically, our TCM contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and a Temporal Attention Module (TAM). MTDM applies a correlation operation to learn pixel-wise fine-grained temporal dynamics for both fast-tempo and slow-tempo. TAM adaptively emphasizes expressive features and suppresses inessential ones via analyzing the global information across various tempos. Extensive experiments conducted on several action recognition benchmarks, e.g. Something-Something V1&V2, Kinetics-400, UCF-101, and HMDB-51, have demonstrated that the proposed TCM is effective to promote the performance of the existing video-based action recognition models for a large margin. The source code is publicly released at https://github.com/zphyix/TCM.


Asunto(s)
Actividades Humanas , Programas Informáticos , Humanos
10.
J Anim Physiol Anim Nutr (Berl) ; 106(3): 552-560, 2022 May.
Artículo en Inglés | MEDLINE | ID: mdl-34111322

RESUMEN

Evidence has shown that oestrogen suppresses lipids deposition in the liver of mammals. However, the molecular mechanism of oestrogen action in hepatic steatosis of geese liver has yet to be determined. This study aimed to investigate the effect of oestrogen on lipid homeostasis at different states of geese hepatocytes in vitro. The results showed that an in vitro model of hepatic steatosis was induced by 1.5 mM sodium oleate via detecting the viability of hepatocytes and content of lipids. When the normal hepatocytes were administrated with different concentrations of oestrogen (E2 ), the expression levels of diacylglycerol acyltransferase 2 (DGAT2), microsomal triglyceride transfer protein (MTTP) and oestrogen receptors (ERs, alpha and beta) were up-regulated only at high concentrations of E2 , whereas the lipid content was not a significant difference. In goose hepatocytes of hepatic steatosis, however, the expression levels of MTTP, apolipoprotein B (apoB) and ERα/ß significantly increased at 10-7 or 10-6  M E2 . Meanwhile, the lipids content significantly increased at 10-9 and 10-8  M E2 and decreased at 80 µM E2 . Further heatmap analysis showed that ERα was clustered with apoB and MTTP in either normal hepatocytes or that of hepatic steatosis. Taken together, E2  might bind to ERα to up-regulate the expression levels of apoB and MTTP, promoting the transportation of lipids and alleviating lipids overload in hepatic steatosis of geese in vitro.


Asunto(s)
Hígado Graso , Gansos , Animales , Apolipoproteínas B/metabolismo , Receptor alfa de Estrógeno/genética , Receptor alfa de Estrógeno/metabolismo , Estrógenos/farmacología , Hígado Graso/inducido químicamente , Hígado Graso/veterinaria , Hepatocitos , Metabolismo de los Lípidos , Hígado/metabolismo
11.
IEEE Trans Cybern ; 52(7): 7136-7150, 2022 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-33382666

RESUMEN

The core prerequisite of most modern trackers is a motion assumption, defined as predicting the current location in a limited search region centering at the previous prediction. For clarity, the central subregion of a search region is denoted as the tracking anchor (e.g., the location of the previous prediction in the current frame). However, providing accurate predictions in all frames is very challenging in the complex nature scenes. In addition, the target locations in consecutive frames often change violently under the attribute of fast motion. Both facts are likely to lead the previous prediction to an unbelievable tracking anchor, which will make the aforementioned prerequisite invalid and cause tracking drift. To enhance the reliability of tracking anchors, we propose a real-time multianchor visual tracking mechanism, called multianchor tracking (MAT). Instead of directly relying on the tracking anchor inherited from the previous prediction, MAT selects the best anchor from an anchor ensemble, which includes several objectness-based anchor proposals and the anchor inherited from the previous prediction. The objectness-based anchors provide several complementary selective search regions, and an entropy-minimization-based selection method is introduced to find the best anchor. Our approach offers two benefits: 1) selective search regions can increase the chance of tracking success with affordable computational load and 2) anchor selection introduces the best anchor for each frame, which breaks the limitation of solo depending on the previous prediction. The extensive experiments of nine base trackers upgraded by MAT on four challenging datasets demonstrate the effectiveness of MAT.


Asunto(s)
Interpretación de Imagen Asistida por Computador , Movimiento , Reproducibilidad de los Resultados
12.
IEEE Trans Image Process ; 30: 4008-4021, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33784621

RESUMEN

Accurate 3D reconstruction of the hand and object shape from a hand-object image is important for understanding human-object interaction as well as human daily activities. Different from bare hand pose estimation, hand-object interaction poses a strong constraint on both the hand and its manipulated object, which suggests that hand configuration may be crucial contextual information for the object, and vice versa. However, current approaches address this task by training a two-branch network to reconstruct the hand and object separately with little communication between the two branches. In this work, we propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We extensively investigate cross-branch feature fusion architectures with MLP or LSTM units. Among the investigated architectures, a variant with LSTM units that enhances object feature with hand feature shows the best performance gain. Moreover, we employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map, which further improves the reconstruction accuracy. Experiments conducted on public datasets demonstrate that our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.

13.
IEEE Trans Image Process ; 30: 2784-2797, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33523810

RESUMEN

Recent advances in the joint processing of a set of images have shown its advantages over individual processing. Unlike the existing works geared towards co-segmentation or co-localization, in this article, we explore a new joint processing topic: image co-skeletonization, which is defined as joint skeleton extraction of the foreground objects in an image collection. It is well known that object skeletonization in a single natural image is challenging, because there is hardly any prior knowledge available about the object present in the image. Therefore, we resort to the idea of image co-skeletonization, hoping that the commonness prior that exists across the semantically similar images can be leveraged to have such knowledge, similar to other joint processing problems such as co-segmentation. Moreover, earlier research has found that augmenting a skeletonization process with the object's shape information is highly beneficial in capturing the image context. Having made these two observations, we propose a coupled framework for co-skeletonization and co-segmentation tasks to facilitate shape information discovery for our co-skeletonization process through the co-segmentation process. While image co-skeletonization is our primary goal, the co-segmentation process might also benefit, in turn, from exploiting skeleton outputs of the co-skeletonization process as central object seeds through such a coupled framework. As a result, both can benefit from each other synergistically. For evaluating image co-skeletonization results, we also construct a novel benchmark dataset by annotating nearly 1.8 K images and dividing them into 38 semantic categories. Although the proposed idea is essentially a weakly supervised method, it can also be employed in supervised and unsupervised scenarios. Extensive experiments demonstrate that the proposed method achieves promising results in all three scenarios.

14.
IEEE Trans Image Process ; 30: 2168-2179, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33471754

RESUMEN

In this paper, we tackle the 3D object representation learning from the perspective of set-to-set matching. Given two 3D objects, calculating their similarity is formulated as the problem of set-to-set similarity measurement between two set of local patches. As local convolutional features from convolutional feature maps are natural representations of local patches, the set-to-set matching between sets of local patches is further converted into a local features pooling problem. To highlight good matchings and suppress the bad ones, we exploit two pooling methods: 1) bilinear pooling and 2) VLAD pooling. We analyze their effectiveness in enhancing the set-to-set matching and meanwhile establish their connection. Moreover, to balance different components inherent in a bilinear-pooled feature, we propose the harmonized bilinear pooling operation, which follows the spirits of intra-normalization used in VLAD pooling. To achieve an end-to-end trainable framework, we implement the proposed harmonized bilinear pooling and intra-normalized VLAD as two layers to construct two types of neural network, multi-view harmonized bilinear network (MHBN) and multi-view VLAD network (MVLADN). Systematic experiments conducted on two public benchmark datasets demonstrate the efficacy of the proposed MHBN and MVLADN in 3D object recognition.

15.
IEEE Trans Pattern Anal Mach Intell ; 43(9): 3259-3272, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-32149622

RESUMEN

Visual captioning, the task of describing an image or a video using one or few sentences, is a challenging task owing to the complexity of understanding the copious visual information and describing it using natural language. Motivated by the success of applying neural networks for machine translation, previous work applies sequence to sequence learning to translate videos into sentences. In this work, different from previous work that encodes visual information using a single flow, we introduce a novel Sibling Convolutional Encoder (SibNet) for visual captioning, which employs a dual-branch architecture to collaboratively encode videos. The first content branch encodes visual content information of the video with an autoencoder, capturing the visual appearance information of the video as other networks often do. While the second semantic branch encodes semantic information of the video via visual-semantic joint embedding, which brings complementary representation by considering the semantics when extracting features from videos. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed model can better represent the rich information in videos. To validate the advantages of the proposed model, we conduct experiments on two benchmarks for video captioning, YouTube2Text and MSR-VTT. Our results demonstrate that the proposed SibNet consistently outperforms existing methods across different evaluation metrics.

16.
IEEE Trans Pattern Anal Mach Intell ; 43(11): 3739-3753, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-32396073

RESUMEN

Compared with depth-based 3D hand pose estimation, it is more challenging to infer 3D hand pose from monocular RGB images, due to the substantial depth ambiguity and the difficulty of obtaining fully-annotated training data. Different from the existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, we propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing we take only RGB inputs for 3D joint predictions. In this way, we alleviate the burden of the costly 3D annotations in real-world dataset. Particularly, we propose a weakly-supervised method, adaptating from fully-annotated synthetic dataset to weakly-labeled real-world single RGB dataset with the aid of a depth regularizer, which serves as weak supervision for 3D pose prediction. To further exploit the physical structure of 3D hand pose, we present a novel CVAE-based statistical framework to embed the pose-specific subspace from RGB images, which can then be used to infer the 3D hand joint locations. Extensive experiments on benchmark datasets validate that our proposed approach outperforms baselines and state-of-the-art methods, which proves the effectiveness of the proposed depth regularizer and the CVAE-based framework.

17.
Artículo en Inglés | MEDLINE | ID: mdl-32755861

RESUMEN

Estimating optical flow from successive video frames is one of the fundamental problems in computer vision and image processing. In the era of deep learning, many methods have been proposed to use convolutional neural networks (CNNs) for optical flow estimation in an unsupervised manner. However, the performance of unsupervised optical flow approaches is still unsatisfactory and often lagging far behind their supervised counterparts, primarily due to over-smoothing across motion boundaries and occlusion. To address these issues, in this paper, we propose a novel method with a new post-processing term and an effective loss function to estimate optical flow in an unsupervised, end-to-end learning manner. Specifically, we first exploit a CNN-based non-local term to refine the estimated optical flow by removing noise and decreasing blur around motion boundaries. This is implemented via automatically learning weights of dependencies over a large spatial neighborhood. Because of its learning ability, the method is effective for various complicated image sequences. Secondly, to reduce the influence of occlusion, a symmetrical energy formulation is introduced to detect the occlusion map from refined bi-directional optical flows. Then the occlusion map is integrated to the loss function. Extensive experiments are conducted on challenging datasets, i.e. FlyingChairs, MPI-Sintel and KITTI to evaluate the performance of the proposed method. The state-of-the-art results demonstrate the effectiveness of our proposed method.

18.
Artículo en Inglés | MEDLINE | ID: mdl-32167897

RESUMEN

Semantic segmentation for lightweight object parsing is a very challenging task, because both accuracy and efficiency (e.g., execution speed, memory footprint or computational complexity) should all be taken into account. However, most previous works pay too much attention to one-sided perspective, either accuracy or speed, and ignore others, which poses a great limitation to actual demands of intelligent devices. To tackle this dilemma, we propose a novel lightweight architecture named Context-Integrated and Feature-Refined Network (CIFReNet). The core components of CIFReNet are the Long-skip Refinement Module (LRM) and the Multi-scale Context Integration Module (MCIM). The LRM is designed to ease the propagation of spatial information between low-level and high-level stages. Furthermore, channel attention mechanism is introduced into the process of long-skip learning to boost the quality of low-level feature refinement. Meanwhile, the MCIM consists of three cascaded Dense Semantic Pyramid (DSP) blocks with image-level features, which is presented to encode multiple context information and enlarge the field of view. Specifically, the proposed DSP block exploits a dense feature sampling strategy to enhance the information representations without significantly increasing the computation cost. Comprehensive experiments are conducted on three benchmark datasets for object parsing including Cityscapes, CamVid, and Helen. As indicated, the proposed method reaches a better trade-off between accuracy and efficiency compared with the other state-of-the-art methods.

19.
IEEE Trans Pattern Anal Mach Intell ; 42(7): 1783-1790, 2020 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-31251177

RESUMEN

Nearest neighbor search is a fundamental problem in computer vision and machine learning. The straightforward solution, linear scan, is both computationally and memory intensive in large scale high-dimensional cases, hence is not preferable in practice. Therefore, there have been a lot of interests in algorithms that perform approximate nearest neighbor (ANN) search. In this paper, we propose a novel addition-based vector quantization algorithm, Asymmetric Mapping Quantization (AMQ), to efficiently conduct ANN search. Unlike existing addition-based quantization methods that suffer from handling the problem caused by the norm of database vector, we map the query vector and database vector using different mapping functions to transform the computation of L-2 distance to inner product similarity, thus do not need to evaluate the norm of database vector. Moreover, we further propose Distributed Asymmetric Mapping Quantization (DAMQ) to enable AMQ to work on very large dataset by distributed learning. Extensive experiments on approximate nearest neighbor search and image retrieval validate the merits of the proposed AMQ and DAMQ.

20.
Artículo en Inglés | MEDLINE | ID: mdl-30605101

RESUMEN

Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal Vector of Locally Aggregated Descriptors (ActionS-STVLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionSST- VLAD encoding approach, by using AVFS-ASFS, the key frame features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted key frame feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks - HMDB51, UCF101, Kinetics and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to state-of-the-art performance for videobased action recognition.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...