Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 31
Filtrer
1.
Article de Anglais | MEDLINE | ID: mdl-38917290

RÉSUMÉ

Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition. On one hand, since the semantics of sign language are expressed by the cooperation of fine-grained hands and coarse-grained trunks, we utilize both granularity information and encode them into latent spaces. The consistency between hand and trunk features is constrained to encourage learning consistent representation of instance samples. On the other hand, inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Additionally, we further bridge the interaction between the embedding spaces of both modalities, facilitating bidirectional knowledge transfer to enhance sign language representation. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin. The source code are publicly available at https://github.com/sakura/Code.

2.
Article de Anglais | MEDLINE | ID: mdl-38083593

RÉSUMÉ

Electromyography (EMG) signal based cross-subject gesture recognition methods reduce the influence of individual differences using transfer learning technology. These methods generally require calibration data collected from new subjects to adapt the pre-trained model to existing subjects. However, collecting calibration data is usually trivial and inconvenient for new subjects. This is currently a major obstacle to the daily use of hand gesture recognition based on EMG signals. To tackle the problem, we propose a novel dynamic domain generalization (DDG) method which is able to achieve accurate recognition on the hand gesture of new subjects without any calibration data. In order to extract more robust and adaptable features, a meta-adjuster is leveraged to generate a series of template coefficients to dynamically adjust dynamic network parameters. Specifically, two different kinds of templates are designed, in which the first one is different kinds of features, such as temporal features, spatial features, and spatial-temporal features, and the second one is different normalization layers. Meanwhile, a mix-style data augmentation method is introduced to make the meta-adjuster's training data more diversified. Experimental results on a public dataset verify that the proposed DDG outperforms the counterpart methods.


Sujet(s)
Algorithmes , Gestes , Humains , Électromyographie/méthodes , Reconnaissance automatique des formes/méthodes ,
3.
IEEE Trans Image Process ; 32: 5764-5778, 2023.
Article de Anglais | MEDLINE | ID: mdl-37831568

RÉSUMÉ

Camera lenses often suffer from optical aberrations, causing radial distortion in the captured images. In those images, there exists a clear and general physical distortion model. However, in existing solutions, such rich geometric prior is under-utilized, and the formulation of an effective prediction target is under-explored. To this end, we introduce Radial Distortion TRansformer (RDTR), a new framework for radial distortion rectification. Our RDTR includes a model-aware pre-training stage for distortion feature extraction and a deformation estimation stage for distortion rectification. Technically, on the one hand, we formulate the general radial distortion (i.e., barrel distortion and pincushion distortion) in camera-captured images with a shared geometric distortion model and perform a unified model-aware pre-training for its learning. With the pre-training, the network is capable of encoding the specific distortion pattern of a radially distorted image. After that, we transfer the learned representations to the learning of distortion rectification. On the other hand, we introduce a new prediction target called backward warping flow for rectifying images with any resolution while avoiding image defects. Extensive experiments are conducted on our synthetic dataset, and the results demonstrate that our method achieves state-of-the-art performance while operating in real-time. Besides, we also validate the generalization of RDTR on real-world images. Our source code and the proposed dataset are publicly available at https://github.com/wwd-ustc/RDTR.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13636-13652, 2023 Nov.
Article de Anglais | MEDLINE | ID: mdl-37467085

RÉSUMÉ

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 11221-11239, 2023 Sep.
Article de Anglais | MEDLINE | ID: mdl-37099464

RÉSUMÉ

Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated. In our framework, the hand pose is regarded as a visual token, which is derived from an off-the-shelf detector. Each visual token is embedded with gesture state and spatial-temporal position encoding. To take full advantage of current sign data resource, we first perform self-supervised learning to model its statistics. To this end, we design multi-level masked modeling strategies (joint, frame and clip) to mimic common failure detection cases. Jointly with these masked modeling strategies, we incorporate model-aware hand prior to better capture hierarchical context over the sequence. After the pre-training, we carefully design simple yet effective prediction heads for downstream tasks. To validate the effectiveness of our framework, we perform extensive experiments on three main SLU tasks, involving isolated and continuous sign language recognition (SLR), and sign language translation (SLT). Experimental results demonstrate the effectiveness of our method, achieving new state-of-the-art performance with a notable gain.

6.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3421-3433, 2023 Mar.
Article de Anglais | MEDLINE | ID: mdl-35594229

RÉSUMÉ

In pixel-based reinforcement learning (RL), the states are raw video frames, which are mapped into hidden representation before feeding to a policy network. To improve sample efficiency of state representation learning, recently, the most prominent work is based on contrastive unsupervised representation. Witnessing that consecutive video frames in a game are highly correlated, to further improve data efficiency, we propose a new algorithm, i.e., masked contrastive representation learning for RL (M-CURL), which takes the correlation among consecutive inputs into consideration. In our architecture, besides a CNN encoder for hidden presentation of input state and a policy network for action selection, we introduce an auxiliary Transformer encoder module to leverage the correlations among video frames. During training, we randomly mask the features of several frames, and use the CNN encoder and Transformer to reconstruct them based on context frames. The CNN encoder and Transformer are jointly trained via contrastive learning where the reconstructed features should be similar to the ground-truth ones while dissimilar to others. During policy evaluation, the CNN encoder and the policy network are used to take actions, and the Transformer module is discarded. Our method achieves consistent improvements over CURL on 14 out of 16 environments from DMControl suite and 23 out of 26 environments from Atari 2600 Games. The code is available at https://github.com/teslacool/m-curl.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 5282-5295, 2023 Apr.
Article de Anglais | MEDLINE | ID: mdl-35925851

RÉSUMÉ

Existing unsupervised person re-identification methods only rely on visual clues to match pedestrians under different cameras. Since visual data is essentially susceptible to occlusion, blur, clothing changes, etc., a promising solution is to introduce heterogeneous data to make up for the defect of visual data. Some works based on full-scene labeling introduce wireless positioning to assist cross-domain person re-identification, but their GPS labeling of entire monitoring scenes is laborious. To this end, we propose to explore unsupervised person re-identification with both visual data and wireless positioning trajectories under weak scene labeling, in which we only need to know the locations of the cameras. Specifically, we propose a novel unsupervised multimodal training framework (UMTF), which models the complementarity of visual data and wireless information. Our UMTF contains a multimodal data association strategy (MMDA) and a multimodal graph neural network (MMGN). MMDA explores potential data associations in unlabeled multimodal data, while MMGN propagates multimodal messages in the video graph based on the adjacency matrix learned from histogram statistics of wireless data. Thanks to the robustness of the wireless data to visual noise and the collaboration of various modules, UMTF is capable of learning a model free of the human label on data. Extensive experimental results conducted on two challenging datasets, i.e., WP-ReID and Campus4K demonstrate the effectiveness of the proposed method.

8.
Front Med (Lausanne) ; 9: 1021763, 2022.
Article de Anglais | MEDLINE | ID: mdl-36419790

RÉSUMÉ

With the aging of the population, the incidence of dysphagia has gradually increased and become a major clinical and public health issue. Early screening of dysphagia in high-risk populations is crucial to identify the risk factors of dysphagia and carry out effective interventions and health management in advance. In this study, the current epidemiology, hazards, risk factors, preventive, and therapeutic measures of dysphagia were comprehensively reviewed, and a literature review of screening instruments commonly used globally was conducted, focusing on their intended populations, main indicators, descriptions, and characteristics. According to analysis and research in the current study, previous studies of dysphagia were predominantly conducted in inpatients, and there are few investigations and screenings on the incidence and influencing factors of dysphagia in the community-dwelling elderly and of dysphagia developing in the natural aging process. Moreover, there are no unified, simple, economical, practical, safe, and easy-to-administer screening tools and evaluation standards for dysphagia in the elderly. It is imperative to focus on dysphagia in the community-dwelling elderly, develop unified screening and assessment tools, and establish an early warning model of risks and a dietary structure model for dysphagia in the community-dwelling elderly.

9.
IEEE Trans Image Process ; 30: 6879-6891, 2021.
Article de Anglais | MEDLINE | ID: mdl-34329164

RÉSUMÉ

Recent advances in video object detection have characterized the exploration of temporal coherence across frames to enhance object detector. Nevertheless, previous solutions either rely on additional inputs (e.g., optical flow) to guide feature aggregation, or complex post-processing to associate bounding boxes. In this paper, we introduce a simple but effective design that learns instance identifiers for instance association in a meta-learning paradigm, which requires no auxiliary inputs or post-processing. Specifically, we present Meta-Learnt Instance Identifier Networks (namely MINet) that novelly meta-learns instance identifiers to recognize identical instances across frames in a single forward-pass, leading to the robust online linking of instances. Technically, depending on the detection results of previous frames, we teach MINet to learn the weights of an instance identifier on the fly, which can be well applied to up-coming frames. Such meta-learning paradigm enables instance identifiers to be flexibly adapted to novel frames at inference. Furthermore, MINet writes/updates the detection results of previous instances into memory and reads from memory when performing inference to encourage temporal consistency for video object detection. Our MINet is appealing in the sense that it is pluggable to any object detection model. Extensive experiments on ImageNet VID dataset demonstrate the superiority of MINet. More remarkably, by integrating MINet into Faster R-CNN, we achieve 80.2% mAP on ImageNet VID dataset.

10.
IEEE Trans Image Process ; 30: 2220-2231, 2021.
Article de Anglais | MEDLINE | ID: mdl-33471758

RÉSUMÉ

In visual tracking, how to effectively model the target appearance using limited prior information remains an open problem. In this paper, we leverage an ensemble of diverse models to learn manifold representations for robust object tracking. The proposed ensemble framework includes a shared backbone network for efficient feature extraction and multiple head networks for independent predictions. Trained by the shared data within an identical structure, the mutually correlated head models heavily hinder the potential of ensemble learning. To shrink the representational overlaps among multiple models while encouraging the diversity of individual predictions, we propose the model diversity and response diversity regularization terms during training. By fusing these distinctive prediction results via a fusion module, the tracking variance caused by the distractor objects can be largely restrained. Our whole framework is end-to-end trained in a data-driven manner, avoiding the heuristic designs of multiple base models and fusion strategies. The proposed method achieves state-of-the-art results on seven challenging benchmarks while operating in real-time.

11.
IEEE Trans Image Process ; 30: 2060-2071, 2021.
Article de Anglais | MEDLINE | ID: mdl-33460378

RÉSUMÉ

Person re-identification is a crucial task of identifying pedestrians of interest across multiple surveillance camera views. For person re-identification, a pedestrian is usually represented with features extracted from a rectangular image region that inevitably contains the scene background, which incurs ambiguity to distinguish different pedestrians and degrades the accuracy. Thus, we propose an end-to-end foreground-aware network to discriminate against the foreground from the background by learning a soft mask for person re-identification. In our method, in addition to the pedestrian ID as supervision for the foreground, we introduce the camera ID of each pedestrian image for background modeling. The foreground branch and the background branch are optimized collaboratively. By presenting a target attention loss, the pedestrian features extracted from the foreground branch become more insensitive to backgrounds, which greatly reduces the negative impact of changing backgrounds on pedestrian matching across different camera views. Notably, in contrast to existing methods, our approach does not require an additional dataset to train a human landmark detector or a segmentation model for locating the background regions. The experimental results conducted on three challenging datasets, i.e., Market-1501, DukeMTMC-reID, and MSMT17, demonstrate the effectiveness of our approach.


Sujet(s)
Identification biométrique/méthodes , Traitement d'image par ordinateur/méthodes , Apprentissage machine , Algorithmes , Humains , Piétons , Enregistrement sur magnétoscope
12.
IEEE Trans Image Process ; 30: 617-627, 2021.
Article de Anglais | MEDLINE | ID: mdl-33232230

RÉSUMÉ

Cross-modal retrieval aims to identify relevant data across different modalities. In this work, we are dedicated to cross-modal retrieval between images and text sentences, which is formulated into similarity measurement for each image-text pair. To this end, we propose a Cross-modal Relation Guided Network (CRGN) to embed image and text into a latent feature space. The CRGN model uses GRU to extract text feature and ResNet model to learn the globally guided image feature. Based on the global feature guiding and sentence generation learning, the relation between image regions can be modeled. The final image embedding is generated by a relation embedding module with an attention mechanism. With the image embeddings and text embeddings, we conduct cross-modal retrieval based on the cosine similarity. The learned embedding space well captures the inherent relevance between image and text. We evaluate our approach with extensive experiments on two public benchmark datasets, i.e., MS-COCO and Flickr30K. Experimental results demonstrate that our approach achieves better or comparable performance with the state-of-the-art methods with notable efficiency.

13.
Article de Anglais | MEDLINE | ID: mdl-32356748

RÉSUMÉ

Correlation filters (CF) have received considerable attention in visual tracking because of their computational efficiency. Leveraging deep features via off-the-shelf CNN models (e.g., VGG), CF trackers achieve state-of-the-art performance while consuming a large number of computing resources. This limits deep CF trackers to be deployed to many mobile platforms on which only a single-core CPU is available. In this paper, we propose to jointly compress and transfer off-the-shelf CNN models within a knowledge distillation framework. We formulate a CNN model pretrained from the image classification task as a teacher network, and distill this teacher network into a lightweight student network as the feature extractor to speed up CF trackers. In the distillation process, we propose a fidelity loss to enable the student network to maintain the representation capability of the teacher network. Meanwhile, we design a tracking loss to adapt the objective of the student network from object recognition to visual tracking. The distillation process is performed offline on multiple layers and adaptively updates the student network using a background-aware online learning scheme. The online adaptation stage exploits the background contents to improve the feature discrimination of the student network. Extensive experiments on six standard datasets demonstrate that the lightweight student network accelerates the speed of state-of-the-art deep CF trackers to real-time on a single-core CPU while maintaining almost the same tracking accuracy.

14.
Proc Natl Acad Sci U S A ; 116(47): 23850-23858, 2019 11 19.
Article de Anglais | MEDLINE | ID: mdl-31685622

RÉSUMÉ

Increasing maize grain yield has been a major focus of both plant breeding and genetic engineering to meet the global demand for food, feed, and industrial uses. We report that increasing and extending expression of a maize MADS-box transcription factor gene, zmm28, under the control of a moderate-constitutive maize promoter, results in maize plants with increased plant growth, photosynthesis capacity, and nitrogen utilization. Molecular and biochemical characterization of zmm28 transgenic plants demonstrated that their enhanced agronomic traits are associated with elevated plant carbon assimilation, nitrogen utilization, and plant growth. Overall, these positive attributes are associated with a significant increase in grain yield relative to wild-type controls that is consistent across years, environments, and elite germplasm backgrounds.


Sujet(s)
Produits agricoles/génétique , Grains comestibles , Gènes de plante , Zea mays/génétique , Séquence d'acides aminés , Produits agricoles/enzymologie , Glutamate-ammonia ligase/métabolisme , Nitrate reductase/métabolisme , Azote/métabolisme , Photosynthèse/génétique , Feuilles de plante/physiologie , Protéines végétales/composition chimique , Protéines végétales/génétique , Protéines végétales/métabolisme , Végétaux génétiquement modifiés , Liaison aux protéines , Transcriptome , Zea mays/enzymologie
15.
Article de Anglais | MEDLINE | ID: mdl-31545723

RÉSUMÉ

Vision-based sign language translation (SLT) is a challenging task due to the complicated variations of facial expressions, gestures, and articulated poses involved in sign linguistics. As a weakly supervised sequence-to-sequence learning problem, in SLT there are usually no exact temporal boundaries of actions. To adequately explore temporal hints in videos, we propose a novel framework named Hierarchical deep Recurrent Fusion (HRF). Aiming at modeling discriminative action patterns, in HRF we design an adaptive temporal encoder to capture crucial RGB visemes and skeleton signees. Specifically, RGB visemes and skeleton signees are learned by the same scheme named Adaptive Clip Summarization (ACS), respectively. ACS consists of three key modules, i.e., variable-length clip mining, adaptive temporal pooling, and attention-aware weighting. Besides, based on unaligned action patterns (RGB visemes and skeleton signees), a query-adaptive decoding fusion is proposed to translate the target sentence. Extensive experiments demonstrate the effectiveness of the proposed HRF framework.

16.
Article de Anglais | MEDLINE | ID: mdl-30106728

RÉSUMÉ

Image retrieval has achieved remarkable improvements with the rapid progress on visual representation and indexing techniques. Given a query image, search engines are expected to retrieve relevant results in which the top-ranked short list is of most value to users. However, it is challenging to measure the retrieval quality on-the-fly without direct user feedbacks. In this paper, we aim at evaluating the quality of retrieval results at the first glance (i.e., with the top-ranked images). For each retrieval result, we compute a correlation based feature matrix that comprises of contextual information from the retrieval list, and then feed it into a convolutional neural network regression model for retrieval quality evaluation. In this proposed framework, multiple visual features are integrated together for robust representations. We optimize the output of this simpleyet- effective evaluation method to be consistent with Discounted Cumulative Gain (DCG), the intuitive measure for the quality of the top-ranked results. We evaluate our method in terms of prediction accuracy and consistency with the ground truth, and demonstrate its practicability in applications such as rank list selection and database image abundance analyses.

17.
IEEE Trans Image Process ; 27(10): 4945-4957, 2018 Oct.
Article de Anglais | MEDLINE | ID: mdl-29985135

RÉSUMÉ

Deep convolutional neural networks (CNNs) have been widely and successfully applied in many computer vision tasks, such as classification, detection, semantic segmentation, and so on. As for image retrieval, while off-the-shelf CNN features from models trained for classification task are demonstrated promising, it remains a challenge to learn specific features oriented for instance retrieval. Witnessing the great success of low-level SIFT feature in image retrieval and its complementary nature to the semantic-aware CNN feature, in this paper, we propose to embed the SIFT feature into the CNN feature with a Siamese structure in a learning-based paradigm. The learning objective consists of two kinds of loss, i.e., similarity loss and fidelity loss. The first loss embeds the image-level nearest neighborhood structure with the SIFT feature into CNN feature learning, while the second loss imposes that the CNN feature with the updated CNN model preserves the fidelity of that from the original CNN model solely trained for classification. After the learning, the generated CNN feature inherits the property of the SIFT feature, which is well oriented for image retrieval. We evaluate our approach on the public data sets, and comprehensive experiments demonstrate the effectiveness of the proposed method.

18.
IEEE Trans Pattern Anal Mach Intell ; 40(5): 1154-1166, 2018 05.
Article de Anglais | MEDLINE | ID: mdl-28278457

RÉSUMÉ

In content-based image retrieval, SIFT feature and the feature from deep convolutional neural network (CNN) have demonstrated promising performance. To fully explore both visual features in a unified framework for effective and efficient retrieval, we propose a collaborative index embedding method to implicitly integrate the index matrices of them. We formulate the index embedding as an optimization problem from the perspective of neighborhood sharing and solve it with an alternating index update scheme. After the iterative embedding, only the embedded CNN index is kept for on-line query, which demonstrates significant gain in retrieval accuracy, with very economical memory cost. Extensive experiments have been conducted on the public datasets with million-scale distractor images. The experimental results reveal that, compared with the recent state-of-the-art retrieval algorithms, our approach achieves competitive accuracy performance with less memory overhead and efficient query computation.

19.
IEEE Trans Image Process ; 25(5): 2311-23, 2016 May.
Article de Anglais | MEDLINE | ID: mdl-26955030

RÉSUMÉ

As the unique identification of a vehicle, license plate is a key clue to uncover over-speed vehicles or the ones involved in hit-and-run accidents. However, the snapshot of over-speed vehicle captured by surveillance camera is frequently blurred due to fast motion, which is even unrecognizable by human. Those observed plate images are usually in low resolution and suffer severe loss of edge information, which cast great challenge to existing blind deblurring methods. For license plate image blurring caused by fast motion, the blur kernel can be viewed as linear uniform convolution and parametrically modeled with angle and length. In this paper, we propose a novel scheme based on sparse representation to identify the blur kernel. By analyzing the sparse representation coefficients of the recovered image, we determine the angle of the kernel based on the observation that the recovered image has the most sparse representation when the kernel angle corresponds to the genuine motion angle. Then, we estimate the length of the motion kernel with Radon transform in Fourier domain. Our scheme can well handle large motion blur even when the license plate is unrecognizable by human. We evaluate our approach on real-world images and compare with several popular state-of-the-art blind image deblurring algorithms. Experimental results demonstrate the superiority of our proposed approach in terms of effectiveness and robustness.

20.
IEEE Trans Pattern Anal Mach Intell ; 38(1): 159-71, 2016 Jan.
Article de Anglais | MEDLINE | ID: mdl-26656584

RÉSUMÉ

In this paper, we investigate the problem of scalable visual feature matching in large-scale image search and propose a novel cascaded scalar quantization scheme in dual resolution. We formulate the visual feature matching as a range-based neighbor search problem and approach it by identifying hyper-cubes with a dual-resolution scalar quantization strategy. Specifically, for each dimension of the PCA-transformed feature, scalar quantization is performed at both coarse and fine resolutions. The scalar quantization results at the coarse resolution are cascaded over multiple dimensions to index an image database. The scalar quantization results over multiple dimensions at the fine resolution are concatenated into a binary super-vector and stored into the index list for efficient verification. The proposed cascaded scalar quantization (CSQ) method is free of the costly visual codebook training and thus is independent of any image descriptor training set. The index structure of the CSQ is flexible enough to accommodate new image features and scalable to index large-scale image database. We evaluate our approach on the public benchmark datasets for large-scale image retrieval. Experimental results demonstrate the competitive retrieval performance of the proposed method compared with several recent retrieval algorithms on feature quantization.

SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE
...