Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Sensors (Basel) ; 23(20)2023 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-37896503

RESUMEN

Unsupervised domain adaptation (UDA) aims to mitigate the performance drop due to the distribution shift between the training and testing datasets. UDA methods have achieved performance gains for models trained on a source domain with labeled data to a target domain with only unlabeled data. The standard feature extraction method in domain adaptation has been convolutional neural networks (CNNs). Recently, attention-based transformer models have emerged as effective alternatives for computer vision tasks. In this paper, we benchmark three attention-based architectures, specifically vision transformer (ViT), shifted window transformer (SWIN), and dual attention vision transformer (DAViT), against convolutional architectures ResNet, HRNet and attention-based ConvNext, to assess the performance of different backbones for domain generalization and adaptation. We incorporate these backbone architectures as feature extractors in the source hypothesis transfer (SHOT) framework for UDA. SHOT leverages the knowledge learned in the source domain to align the image features of unlabeled target data in the absence of source domain data, using self-supervised deep feature clustering and self-training. We analyze the generalization and adaptation performance of these models on standard UDA datasets and aerial UDA datasets. In addition, we modernize the training procedure commonly seen in UDA tasks by adding image augmentation techniques to help models generate richer features. Our results show that ConvNext and SWIN offer the best performance, indicating that the attention mechanism is very beneficial for domain generalization and adaptation with both transformer and convolutional architectures. Our ablation study shows that our modernized training recipe, within the SHOT framework, significantly boosts performance on aerial datasets.

2.
Sensors (Basel) ; 23(7)2023 Apr 04.
Artículo en Inglés | MEDLINE | ID: mdl-37050785

RESUMEN

We present Full-BAPose, a novel bottom-up approach for full body pose estimation that achieves state-of-the-art results without relying on external people detectors. The Full-BAPose method addresses the broader task of full body pose estimation including hands, feet, and facial landmarks. Our deep learning architecture is end-to-end trainable based on an encoder-decoder configuration with HRNet backbone and multi-scale representations using a disentangled waterfall atrous spatial pooling module. The disentangled waterfall module leverages the efficiency of progressive filtering, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, it combines multi-scale features obtained from the waterfall flow with the person-detection capability of the disentangled adaptive regression and incorporates adaptive convolutions to infer keypoints more precisely in crowded scenes. Full-BAPose achieves state-of-the art performance on the challenging CrowdPose and COCO-WholeBody datasets, with AP of 72.2% and 68.4%, respectively, based on 133 keypoints. Our results demonstrate that Full-BAPose is efficient and robust when operating under a variety conditions, including multiple people, changes in scale, and occlusions.

3.
Cognition ; 224: 105040, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-35192994

RESUMEN

If language has evolved for communication, languages should be structured such that they maximize the efficiency of processing. What is efficient for communication in the visual-gestural modality is different from the auditory-oral modality, and we ask here whether sign languages have adapted to the affordances and constraints of the signed modality. During sign perception, perceivers look almost exclusively at the lower face, rarely looking down at the hands. This means that signs articulated far from the lower face must be perceived through peripheral vision, which has less acuity than central vision. We tested the hypothesis that signs that are more predictable (high frequency signs, signs with common handshapes) can be produced further from the face because precise visual resolution is not necessary for recognition. Using pose estimation algorithms, we examined the structure of over 2000 American Sign Language lexical signs to identify whether lexical frequency and handshape probability affect the position of the wrist in 2D space. We found that frequent signs with rare handshapes tended to occur closer to the signer's face than frequent signs with common handshapes, and that frequent signs are generally more likely to be articulated further from the face than infrequent signs. Together these results provide empirical support for anecdotal assertions that the phonological structure of sign language is shaped by the properties of the human visual and motor systems.


Asunto(s)
Lenguaje , Lengua de Signos , Gestos , Humanos , Reconocimiento en Psicología , Percepción Visual
4.
IEEE Trans Circuits Syst Video Technol ; 32(10): 6642-6656, 2022 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37215187

RESUMEN

Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local vision representation for sentence generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GLR framework, namely a global-local representation granularity. Our GLR demonstrates three advantages over the prior efforts. First, we propose a simple solution, which exploits extensive vision representations from different video ranges to improve linguistic expression. Second, we devise a novel global-local encoder, which encodes different video representations including long-range, short-range and local-keyframe, to produce rich semantic vocabulary for obtaining a descriptive granularity of video contents across frames. Finally, we introduce the progressive training strategy which can effectively organize feature learning to incur optimal captioning behavior. Evaluated on the MSR-VTT and MSVD dataset, we outperform recent state-of-the-art methods including a well-tuned SA-LSTM baseline by a significant margin, with shorter training schedules. Because of its simplicity and efficacy, we hope that our GLR could serve as a strong baseline for many video understanding tasks besides video captioning. Code will be available.

5.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 9641-9653, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-34727028

RESUMEN

We propose UniPose+, a unified framework for 2D and 3D human pose estimation in images and videos. The UniPose+ architecture leverages multi-scale feature representations to increase the effectiveness of backbone feature extractors, with no significant increase in network size and no postprocessing. Current pose estimation methods heavily rely on statistical postprocessing or predefined anchor poses for joint localization. The UniPose+ framework incorporates contextual information across scales and joint localization with Gaussian heatmap modulation at the decoder output to estimate 2D and 3D human pose in a single stage with state-of-the-art accuracy, without relying on predefined anchor poses. The multi-scale representations allowed by the waterfall module in the UniPose+ framework leverage the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Our results on multiple datasets demonstrate that UniPose+, with a HRNet, ResNet or SENet backbone and waterfall module, is a robust and efficient architecture for single person 2D and 3D pose estimation in single images and videos.


Asunto(s)
Algoritmos , Imagenología Tridimensional , Humanos , Imagenología Tridimensional/métodos
6.
Sensors (Basel) ; 21(23)2021 Dec 02.
Artículo en Inglés | MEDLINE | ID: mdl-34884072

RESUMEN

Deep learning grew in importance in recent years due to its versatility and excellent performance on supervised classification tasks. A core assumption for such supervised approaches is that the training and testing data are drawn from the same underlying data distribution. This may not always be the case, and in such cases, the performance of the model is degraded. Domain adaptation aims to overcome the domain shift between the source domain used for training and the target domain data used for testing. Unsupervised domain adaptation deals with situations where the network is trained on labeled data from the source domain and unlabeled data from the target domain with the goal of performing well on the target domain data at the time of deployment. In this study, we overview seven state-of-the-art unsupervised domain adaptation models based on deep learning and benchmark their performance on three new domain adaptation datasets created from publicly available aerial datasets. We believe this is the first study on benchmarking domain adaptation methods for aerial data. In addition to reporting classification performance for the different domain adaptation models, we present t-SNE visualizations that illustrate the benefits of the adaptation process.


Asunto(s)
Adaptación Fisiológica , Benchmarking
7.
Sensors (Basel) ; 21(22)2021 Nov 11.
Artículo en Inglés | MEDLINE | ID: mdl-34833577

RESUMEN

We propose GourmetNet, a single-pass, end-to-end trainable network for food segmentation that achieves state-of-the-art performance. Food segmentation is an important problem as the first step for nutrition monitoring, food volume and calorie estimation. Our novel architecture incorporates both channel attention and spatial attention information in an expanded multi-scale feature representation using our advanced Waterfall Atrous Spatial Pooling module. GourmetNet refines the feature extraction process by merging features from multiple levels of the backbone through the two attention modules. The refined features are processed with the advanced multi-scale waterfall module that combines the benefits of cascade filtering and pyramid representations without requiring a separate decoder or post-processing. Our experiments on two food datasets show that GourmetNet significantly outperforms existing current state-of-the-art methods.


Asunto(s)
Procesamiento de Imagen Asistido por Computador , Redes Neurales de la Computación , Atención , Alimentos
8.
Sensors (Basel) ; 20(2)2020 Jan 19.
Artículo en Inglés | MEDLINE | ID: mdl-31963879

RESUMEN

In recent years, deep learning-based visual object trackers have achieved state-of-the-art performance on several visual object tracking benchmarks. However, most tracking benchmarks are focused on ground level videos, whereas aerial tracking presents a new set of challenges. In this paper, we compare ten trackers based on deep learning techniques on four aerial datasets. We choose top performing trackers utilizing different approaches, specifically tracking by detection, discriminative correlation filters, Siamese networks and reinforcement learning. In our experiments, we use a subset of OTB2015 dataset with aerial style videos; the UAV123 dataset without synthetic sequences; the UAV20L dataset, which contains 20 long sequences; and DTB70 dataset as our benchmark datasets. We compare the advantages and disadvantages of different trackers in different tracking situations encountered in aerial data. Our findings indicate that the trackers perform significantly worse in aerial datasets compared to standard ground level videos. We attribute this effect to smaller target size, camera motion, significant camera rotation with respect to the target, out of view movement, and clutter in the form of occlusions or similar looking distractors near tracked object.

9.
Sensors (Basel) ; 19(24)2019 Dec 05.
Artículo en Inglés | MEDLINE | ID: mdl-31817366

RESUMEN

We propose a new efficient architecture for semantic segmentation, based on a "Waterfall" Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with Conditional Random Fields, which further reduces complexity and required training time. We demonstrate that the Waterfall approach with a ResNet backbone is a robust and efficient architecture for semantic segmentation obtaining state-of-the-art results with significant reduction in the number of parameters for the Pascal VOC dataset and the Cityscapes dataset.

10.
IEEE Trans Image Process ; 23(4): 1737-50, 2014 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-24808343

RESUMEN

The parsimonious nature of sparse representations has been successfully exploited for the development of highly accurate classifiers for various scientific applications. Despite the successes of Sparse Representation techniques, a large number of dictionary atoms as well as the high dimensionality of the data can make these classifiers computationally demanding. Furthermore, sparse classifiers are subject to the adverse effects of a phenomenon known as coefficient contamination, where, for example, variations in pose may affect identity and expression recognition. We analyze the interaction between dimensionality reduction and sparse representations, and propose a technique, called Linear extension of Graph Embedding K-means-based Singular Value Decomposition (LGE-KSVD) to address both issues of computational intensity and coefficient contamination. In particular, the LGE-KSVD utilizes variants of the LGE to optimize the K-SVD, an iterative technique for small yet over complete dictionary learning. The dimensionality reduction matrix, sparse representation dictionary, sparse coefficients, and sparsity-based classifier are jointly learned through the LGE-KSVD. The atom optimization process is redefined to allow variable support using graph embedding techniques and produce a more flexible and elegant dictionary learning algorithm. Results are presented on a wide variety of facial and activity recognition problems that demonstrate the robustness of the proposed method.


Asunto(s)
Cara/anatomía & histología , Expresión Facial , Procesamiento de Imagen Asistido por Computador/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Algoritmos , Inteligencia Artificial , Bases de Datos Factuales , Femenino , Humanos , Masculino
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...