RESUMO
Using visual place recognition (VPR) technology to ascertain the geographical location of publicly available images is a pressing issue. Although most current VPR methods achieve favorable results under ideal conditions, their performance in complex environments, characterized by lighting variations, seasonal changes, and occlusions, is generally unsatisfactory. Therefore, obtaining efficient and robust image feature descriptors in complex environments is a pressing issue. In this study, we utilized the DINOv2 model as the backbone for trimming and fine-tuning to extract robust image features and employed a feature mix module to aggregate image features, resulting in globally robust and generalizable descriptors that enable high-precision VPR. We experimentally demonstrated that the proposed DINO-Mix outperforms the current state-of-the-art (SOTA) methods. Using test sets having lighting variations, seasonal changes, and occlusions such as Tokyo24/7, Nordland, and SF-XL-Testv1, our proposed architecture achieved Top-1 accuracy rates of 91.75%, 80.18%, and 82%, respectively, and exhibited an average accuracy improvement of 5.14%. In addition, we compared it with other SOTA methods using representative image retrieval case studies, and our architecture outperformed its competitors in terms of VPR performance. Furthermore, we visualized the attention maps of DINO-Mix and other methods to provide a more intuitive understanding of their respective strengths. These visualizations serve as compelling evidence of the superiority of the DINO-Mix framework in this domain.
RESUMO
Visual place recognition (VPR) involves obtaining robust image descriptors to cope with differences in camera viewpoints and drastic external environment changes. Utilizing multiscale features improves the robustness of image descriptors; however, existing methods neither exploit the multiscale features generated during feature extraction nor consider the feature redundancy problem when fusing multiscale information when image descriptors are enhanced. We propose a novel encoding strategy-convolutional multilayer perceptron orthogonal fusion of multiscale features (ConvMLP-OFMS)-for VPR. A ConvMLP is used to obtain robust and generalized global image descriptors and the multiscale features generated during feature extraction are used to enhance the global descriptors to cope with changes in the environment and viewpoints. Additionally, an attention mechanism is used to eliminate noise and redundant information. Compared to traditional methods that use tensor splicing for feature fusion, we introduced matrix orthogonal decomposition to eliminate redundant information. Experiments demonstrated that the proposed architecture outperformed NetVLAD, CosPlace, ConvAP, and other methods. On the Pittsburgh and MSLS datasets, which contained significant viewpoint and illumination variations, our method achieved 92.5% and 86.5% Recall@1, respectively. We also achieved good performances-80.6% and 43.2%-on the SPED and NordLand datasets, respectively, which have more extreme illumination and appearance variations.