RESUMEN
In the realm of hyperspectral image classification, the pursuit of heightened accuracy and comprehensive feature extraction has led to the formulation of an advance architectural paradigm. This study proposed a model encapsulated within the framework of a unified model, which synergistically leverages the capabilities of three distinct branches: the swin transformer, convolutional neural network, and encoder-decoder. The main objective was to facilitate multiscale feature learning, a pivotal facet in hyperspectral image classification, with each branch specializing in unique facets of multiscale feature extraction. The swin transformer, recognized for its competence in distilling long-range dependencies, captures structural features across different scales; simultaneously, convolutional neural networks undertake localized feature extraction, engendering nuanced spatial information preservation. The encoder-decoder branch undertakes comprehensive analysis and reconstruction, fostering the assimilation of both multiscale spectral and spatial intricacies. To evaluate our approach, we conducted experiments on publicly available datasets and compared the results with state-of-the-art methods. Our proposed model obtains the best classification result compared to others. Specifically, overall accuracies of 96.87%, 98.48%, and 98.62% were obtained on the Xuzhou, Salinas, and LK datasets.
RESUMEN
Hyperspectral image (HSI) data has a wide range of valuable spectral information for numerous tasks. HSI data encounters challenges such as small training samples, scarcity, and redundant information. Researchers have introduced various research works to address these challenges. Convolution Neural Network (CNN) has gained significant success in the field of HSI classification. CNN's primary focus is to extract low-level features from HSI data, and it has a limited ability to detect long-range dependencies due to the confined filter size. In contrast, vision transformers exhibit great success in the HSI classification field due to the use of attention mechanisms to learn the long-range dependencies. As mentioned earlier, the primary issue with these models is that they require sufficient labeled training data. To address this challenge, we proposed a spectral-spatial feature extractor group attention transformer that consists of a multiscale feature extractor to extract low-level or shallow features. For high-level semantic feature extraction, we proposed a group attention mechanism. Our proposed model is evaluated using four publicly available HSI datasets, which are Indian Pines, Pavia University, Salinas, and the KSC dataset. Our proposed approach achieved the best classification results in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient. As mentioned earlier, the proposed approach utilized only 5%, 1%, 1%, and 10% of the training samples from the publicly available four datasets.