Búsqueda | Portal de Búsqueda de la BVS Colombia

Lightweight Scene Text Recognition Based on Transformer.

Luan, Xin; Zhang, Jinwei; Xu, Miaomiao; Silamu, Wushouer; Li, Yanbing.

Sensors (Basel) ; 23(9)2023 May 05.

Artículo en Inglés | MEDLINE | ID: mdl-37177694

RESUMEN

Scene text recognition (STR) has been a hot research field in computer vision, aiming to recognize text in natural scenes using computers. Currently, attention-based encoder-decoder frameworks struggle to precisely align feature regions with the target object when dealing with complex and low-quality images, a phenomenon known as attention drift. Additionally, with the rise of Transformer, the increasing size of parameters results in higher computational costs. In order to solve the above problems, based on the latest research results of Vision Transformer (ViT), we utilize an additional position-enhancement branch to alleviate attention drift and dynamically fused position information with visual information to achieve better recognition accuracy. The experimental results demonstrate that our model achieves a 3% higher average recognition accuracy on the test set compared to the baseline. Meanwhile, our model maintains the advantage of a small number of parameters and fast inference speed, achieving a good balance between accuracy, speed, and computational load.

Deep Modular Bilinear Attention Network for Visual Question Answering.

Yan, Feng; Silamu, Wushouer; Li, Yanbing.

Sensors (Basel) ; 22(3)2022 Jan 28.

Artículo en Inglés | MEDLINE | ID: mdl-35161790

RESUMEN

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.

Asunto(s)

Lenguaje , Procesamiento de Lenguaje Natural , Proyectos de Investigación

Focal cross transformer: multi-view brain tumor segmentation model based on cross window and focal self-attention.

Zongren, Li; Silamu, Wushouer; Shurui, Feng; Guanghui, Yan.

Front Neurosci ; 17: 1192867, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37250393

RESUMEN

Introduction: Recently, the Transformer model and its variants have been a great success in terms of computer vision, and have surpassed the performance of convolutional neural networks (CNN). The key to the success of Transformer vision is the acquisition of short-term and long-term visual dependencies through self-attention mechanisms; this technology can efficiently learn global and remote semantic information interactions. However, there are certain challenges associated with the use of Transformers. The computational cost of the global self-attention mechanism increases quadratically, thus hindering the application of Transformers for high-resolution images. Methods: In view of this, this paper proposes a multi-view brain tumor segmentation model based on cross windows and focal self-attention which represents a novel mechanism to enlarge the receptive field by parallel cross windows and improve global dependence by using local fine-grained and global coarse-grained interactions. First, the receiving field is increased by parallelizing the self-attention of horizontal and vertical fringes in the cross window, thus achieving strong modeling capability while limiting the computational cost. Second, the focus on self-attention with regards to local fine-grained and global coarse-grained interactions enables the model to capture short-term and long-term visual dependencies in an efficient manner. Results: Finally, the performance of the model on Brats2021 verification set is as follows: dice Similarity Score of 87.28, 87.35 and 93.28%; Hausdorff Distance (95%) of 4.58 mm, 5.26 mm, 3.78 mm for the enhancing tumor, tumor core and whole tumor, respectively. Discussion: In summary, the model proposed in this paper has achieved excellent performance while limiting the computational cost.

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA