Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 85
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Sensors (Basel) ; 22(8)2022 Apr 12.
Artigo em Inglês | MEDLINE | ID: mdl-35458926

RESUMO

Room-temperature phosphorescent (RTP) carbon dots (CDs) have promising applications in bioimaging, anticounterfeiting, and information encryption owing to their long lifetimes and wide Stokes shifts. Numerous researchers are interested in developing highly bright RTP CDs using environmentally friendly and safe synthesis processes (e.g., natural raw materials and zero-pollution production pathways). In this study, we successfully synthesized RTP CDs using a hydrothermal process employing natural vitamins as a raw material, ethylenediamine as a passivator, and boric acid as a phosphorescent enhancer, which is referred to as phosphorescent CD (PCD). The PCDs exhibit both bright blue fluorescence emission and green RTP emission, with a phosphorescence lifetime as long as 293 ms and an excellent green afterglow visible to the naked eye for up to 7.0 s. The total quantum yield is 12.69%. The phosphorescence quantum yield (PQY) is up to 5.15%. Based on the RTP performance, PCDs have been successfully employed for anticounterfeiting and information protection applications. The results of this study provide a green strategy for the scalable synthesis of RTP materials, which is a practical method for the fabrication of RTP materials with high efficiency and long afterglow lifetimes.


Assuntos
Carbono , Radiação , Fluorescência
2.
Tour Manag ; 93: 104618, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-35782689

RESUMO

Taking appropriate strategies in response to the COVID-19 crisis has presented significant challenges to the hospitality industry. Based on situational crisis communication theory (SCCT), this study aims to examine how the hotel industry has adopted strategies in shaping customers' experience and satisfaction. A mixed-method approach was employed by analysing 6556 COVID-19 related online reviews. The qualitative findings suggest that 'rebuild strategies' dominated most hotels' response to the COVID-19 crisis while the quantitative findings confirm the direct impact of affective evaluation and cognitive effort on customer satisfaction. The results further reveal that hotels' crisis response strategies moderate the effects of affective evaluation and cognitive effort on customer satisfaction. The study contributes to new knowledge on health-related crisis management and expands the application of SCCT in tourism research.

3.
Community Ment Health J ; 55(1): 112-119, 2019 01.
Artigo em Inglês | MEDLINE | ID: mdl-29532304

RESUMO

It is well known that health inequality has been happening between rural and urban Chinese populations, however, the health differences among rural Chinese residents remain unclear. This study aims to assess the physical and mental health of rural Chinese residents in different social classes, and then to examine the mediating role of hopelessness between social class and health-related quality of life (HRQOL). A stratified multi-stage sampling was used to recruit 2003 rural residents responding to the 12-item Short Form Health Survey (SF-12). The results showed that lower-class rural Chinese residents reported lower physical and mental health as well as a higher level of hopelessness. Furthermore, hopelessness could fully mediate the association between social class and physical and mental health. These findings will generate significant implications for identifying those at particular risk for lower quality of life and designing social work intervention programs in rural China's context.


Assuntos
Povo Asiático/psicologia , Povo Asiático/estatística & dados numéricos , Emoções , Qualidade de Vida , População Rural/estatística & dados numéricos , Classe Social , Adulto , Idoso , China , Feminino , Nível de Saúde , Disparidades nos Níveis de Saúde , Humanos , Masculino , Pessoa de Meia-Idade , Inquéritos e Questionários
4.
Artigo em Inglês | MEDLINE | ID: mdl-38748521

RESUMO

Vision Transformers have been the most popular network architecture in visual recognition recently due to the strong ability of encode global information. However, its high computational cost when processing high-resolution images limits the applications in downstream tasks. In this paper, we take a deep look at the internal structure of self-attention and present a simple Transformer style convolutional neural network (ConvNet) for visual recognition. By comparing the design principles of the recent ConvNets and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels ( ≥ 7×7) nested in convolutional layers and we observe a consistent performance improvement when gradually increasing the kernel size from 5×5 to 21×21. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20k semantic segmentation. Our code is available at https://github.com/HVision-NKU/Conv2Former.

5.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 2316-2332, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37934644

RESUMO

Tuberculosis (TB) is a major global health threat, causing millions of deaths annually. Although early diagnosis and treatment can greatly improve the chances of survival, it remains a major challenge, especially in developing countries. Recently, computer-aided tuberculosis diagnosis (CTD) using deep learning has shown promise, but progress is hindered by limited training data. To address this, we establish a large-scale dataset, namely the Tuberculosis X-ray (TBX11 K) dataset, which contains 11 200 chest X-ray (CXR) images with corresponding bounding box annotations for TB areas. This dataset enables the training of sophisticated detectors for high-quality CTD. Furthermore, we propose a strong baseline, SymFormer, for simultaneous CXR image classification and TB infection area detection. SymFormer incorporates Symmetric Search Attention (SymAttention) to tackle the bilateral symmetry property of CXR images for learning discriminative features. Since CXR images may not strictly adhere to the bilateral symmetry property, we also propose Symmetric Positional Encoding (SPE) to facilitate SymAttention through feature recalibration. To promote future research on CTD, we build a benchmark by introducing evaluation metrics, evaluating baseline models reformed from existing detectors, and running an online challenge. Experiments show that SymFormer achieves state-of-the-art performance on the TBX11 K dataset.


Assuntos
Algoritmos , Tuberculose , Humanos , Tuberculose/diagnóstico por imagem , Computadores
6.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 2506-2517, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38015699

RESUMO

Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e., pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves 85.3% top-1 accuracy on ImageNet and 52.5% mIoU on ADE20k, surpassing previous best results by 0.7% and 1.8% respectively.

7.
Artigo em Inglês | MEDLINE | ID: mdl-38949946

RESUMO

Previous knowledge distillation (KD) methods mostly focus on compressing network architectures, which is not thorough enough in deployment as some costs like transmission bandwidth and imaging equipment are related to the image size. Therefore, we propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints. Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources. Specifically, we first propose an input spatial representation distillation (ISRD) mechanism to transfer spatial knowledge from large images to student's input module, which can facilitate stable knowledge transfer between CNN and ViT. Then, a Teacher-Assistant-Student (TAS) framework is further established to disentangle pixel distillation into the model compression stage and input compression stage, which significantly reduces the overall complexity of pixel distillation and the difficulty of distilling intermediate knowledge. Finally, we adapt pixel distillation to object detection via an aligned feature for preservation (AFP) strategy for TAS, which aligns output dimensions of detectors at each stage by manipulating features and anchors of the assistant. Comprehensive experiments on image classification and object detection demonstrate the effectiveness of our method.

8.
Adv Sci (Weinh) ; : e2404230, 2024 Jul 10.
Artigo em Inglês | MEDLINE | ID: mdl-38984451

RESUMO

Glioblastoma multiforme (GBM) is the most common primary malignant brain tumor and known for its challenging prognosis. Sonodynamic therapy (SDT) is an innovative therapeutic approach that shows promise in tumor elimination by activating sonosensitizers with low-intensity ultrasound. In this study, a novel sonosensitizer is synthesized using Cu-doped carbon dots (Cu-CDs) for the sonodynamic treatment of GBM. Doping with copper transforms the carbon dots into a p-n type semiconductor having a bandgap of 1.58 eV, a prolonged lifespan of 10.7 µs, and an improved electron- and hole-separation efficiency. The sonodynamic effect is efficiency enhanced. Western blot analysis reveals that the Cu-CDs induces a biological response leading to cell death, termed as cuproptosis. Specifically, Cu-CDs upregulate dihydrosulfanyl transacetylase expression, thereby establishing a synergistic therapeutic effect against tumor cell death when combined with SDT. Furthermore, Cu-CDs exhibit excellent permeability through the blood-brain barrier and potent anti-tumor activity. Importantly, the Cu-CDs effectively impede the growth of glioblastoma tumors and prolong the survival of mice bearing these tumors. This study provides support for the application of carbon-based nanomaterials as sonosensitizers in tumor therapy.

9.
IEEE Trans Image Process ; 33: 2058-2073, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38470576

RESUMO

Existing Cross-Domain Few-Shot Learning (CDFSL) methods require access to source domain data to train a model in the pre-training phase. However, due to increasing concerns about data privacy and the desire to reduce data transmission and training costs, it is necessary to develop a CDFSL solution without accessing source data. For this reason, this paper explores a Source-Free CDFSL (SF-CDFSL) problem, in which CDFSL is addressed through the use of existing pretrained models instead of training a model with source data, avoiding accessing source data. However, due to the lack of source data, we face two key challenges: effectively tackling CDFSL with limited labeled target samples, and the impossibility of addressing domain disparities by aligning source and target domain distributions. This paper proposes an Enhanced Information Maximization with Distance-Aware Contrastive Learning (IM-DCL) method to address these challenges. Firstly, we introduce the transductive mechanism for learning the query set. Secondly, information maximization (IM) is explored to map target samples into both individual certainty and global diversity predictions, helping the source model better fit the target data distribution. However, IM fails to learn the decision boundary of the target task. This motivates us to introduce a novel approach called Distance-Aware Contrastive Learning (DCL), in which we consider the entire feature set as both positive and negative sets, akin to Schrödinger's concept of a dual state. Instead of a rigid separation between positive and negative sets, we employ a weighted distance calculation among features to establish a soft classification of the positive and negative sets for the entire feature set. We explore three types of negative weights to enhance the performance of CDFSL. Furthermore, we address issues related to IM by incorporating contrastive constraints between object features and their corresponding positive and negative sets. Evaluations of the 4 datasets in the BSCD-FSL benchmark indicate that the proposed IM-DCL, without accessing the source domain, demonstrates superiority over existing methods, especially in the distant domain task. Additionally, the ablation study and performance analysis confirmed the ability of IM-DCL to handle SF-CDFSL. The code will be made public at https://github.com/xuhuali-mxj/IM-DCL.

10.
Artigo em Inglês | MEDLINE | ID: mdl-38833400

RESUMO

A fundamental limitation of object detectors is that they suffer from "spatial bias", and in particular perform less satisfactorily when detecting objects near image borders. For a long time, there has been a lack of effective ways to measure and identify spatial bias, and little is known about where it comes from and what degree it is. To this end, we present a new zone evaluation protocol, extending from the traditional evaluation to a more generalized one, which measures the detection performance over zones, yielding a series of Zone Precisions (ZPs). For the first time, we provide numerical results, showing that the object detectors perform quite unevenly across the zones. Surprisingly, the detector's performance in the 96% border zone of the image does not reach the AP value (Average Precision, commonly regarded as the average detection performance in the entire image zone). To better understand spatial bias, a series of heuristic experiments are conducted. Our investigation excludes two intuitive conjectures about spatial bias that the object scale and the absolute positions of objects barely influence the spatial bias. We find that the key lies in the human-imperceptible divergence in data patterns between objects in different zones, thus eventually forming a visible performance gap between the zones. With these findings, we finally discuss a future direction for object detection, namely, spatial disequilibrium problem, aiming at pursuing a balanced detection ability over the entire image zone. By broadly evaluating 10 popular object detectors and 5 detection datasets, we shed light on the spatial bias of object detectors. We hope this work could raise a focus on detection robustness. The source codes, evaluation protocols, and tutorials are publicly available at https://github.com/Zzh-tju/ZoneEval.

11.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15619-15631, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37647184

RESUMO

Learning representations with self-supervision for convolutional networks (CNN) has been validated to be effective for vision tasks. As an alternative to CNN, vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. Still, most works follow self-supervised strategies designed for CNN, e.g., instance-level discrimination of samples, but they ignore the properties of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks. To enforce this property, we explore the feature SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations, i.e., spatial/channel self-relations, for self-supervised learning. Self-relation based learning further enhances the relation modeling ability of ViT, resulting in stronger representations that stably improve performance on multiple downstream tasks.

12.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 4214-4228, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-35994547

RESUMO

Open set recognition enables deep neural networks (DNNs) to identify samples of unknown classes, while maintaining high classification accuracy on samples of known classes. Existing methods based on auto-encoder (AE) and prototype learning show great potential in handling this challenging task. In this study, we propose a novel method, called Class-Specific Semantic Reconstruction (CSSR), that integrates the power of AE and prototype learning. Specifically, CSSR replaces prototype points with manifolds represented by class-specific AEs. Unlike conventional prototype-based methods, CSSR models each known class on an individual AE manifold, and measures class belongingness through AE's reconstruction error. Class-specific AEs are plugged into the top of the DNN backbone and reconstruct the semantic representations learned by the DNN instead of the raw image. Through end-to-end learning, the DNN and the AEs boost each other to learn both discriminative and representative information. The results of experiments conducted on multiple datasets show that the proposed method achieves outstanding performance in both close and open set recognition and is sufficiently simple and flexible to incorporate into existing frameworks.

13.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2344-2366, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35404809

RESUMO

In this paper, we identify and address a serious design bias of existing salient object detection (SOD) datasets, which unrealistically assume that each image should contain at least one clear and uncluttered salient object. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. However, these models are still far from satisfactory when applied to real-world scenes. Based on our analyses, we propose a new high-quality dataset and update the previous saliency benchmark. Specifically, our dataset, called Salient Objects in Clutter (SOC), includes images with both salient and non-salient objects from several common object categories. In addition to object category annotations, each salient image is accompanied by attributes that reflect common challenges in common scenes, which can help provide deeper insight into the SOD problem. Further, with a given saliency encoder, e.g., the backbone network, existing saliency models are designed to achieve mapping from the training image set to the training ground-truth set. We therefore argue that improving the dataset can yield higher performance gains than focusing only on the decoder design. With this in mind, we investigate several dataset-enhancement strategies, including label smoothing to implicitly emphasize salient boundaries, random image augmentation to adapt saliency models to various scenarios, and self-supervised learning as a regularization strategy to learn from small datasets. Our extensive results demonstrate the effectiveness of these tricks. We also provide a comprehensive benchmark for SOD, which can be found in our repository: https://github.com/DengPingFan/SODBenchmark.

14.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 2984-3002, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35714090

RESUMO

Temporal/spatial receptive fields of models play an important role in sequential/spatial tasks. Large receptive fields facilitate long-term relations, while small receptive fields help to capture the local details. Existing methods construct models with hand-designed receptive fields in layers. Can we effectively search for receptive field combinations to replace hand-designed patterns? To answer this question, we propose to find better receptive field combinations through a global-to-local search scheme. Our search scheme exploits both global search to find the coarse combinations and local search to get the refined receptive field combinations further. The global search finds possible coarse combinations other than human-designed patterns. On top of the global search, we propose an expectation-guided iterative local search scheme to refine combinations effectively. Our RF-Next models, plugging receptive field search to various models, boost the performance on many tasks, e.g., temporal action segmentation, object detection, instance segmentation, and speech synthesis. The source code is publicly available on http://mmcheng.net/rfnext.

15.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 6647-6658, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-32886607

RESUMO

With the success of deep learning in classifying short trimmed videos, more attention has been focused on temporally segmenting and classifying activities in long untrimmed videos. State-of-the-art approaches for action segmentation utilize several layers of temporal convolution and temporal pooling. Despite the capabilities of these approaches in capturing temporal dependencies, their predictions suffer from over-segmentation errors. In this paper, we propose a multi-stage architecture for the temporal action segmentation task that overcomes the limitations of the previous approaches. The first stage generates an initial prediction that is refined by the next ones. In each stage we stack several layers of dilated temporal convolutions covering a large receptive field with few parameters. While this architecture already performs well, lower layers still suffer from a small receptive field. To address this limitation, we propose a dual dilated layer that combines both large and small receptive fields. We further decouple the design of the first stage from the refining stages to address the different requirements of these stages. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our models achieve state-of-the-art results on three datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.

16.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 887-904, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-34982676

RESUMO

We explore the potential of pooling techniques on the task of salient object detection by expanding its role in convolutional neural networks. In general, two pooling-based modules are proposed. A global guidance module (GGM) is first built based on the bottom-up pathway of the U-shape architecture, which aims to guide the location information of the potential salient objects into layers at different feature levels. A feature aggregation module (FAM) is further designed to seamlessly fuse the coarse-level semantic information with the fine-level features in the top-down pathway. We can progressively refine the high-level semantic features with these two modules and obtain detail enriched saliency maps. Experimental results show that our proposed approach can locate the salient objects more accurately with sharpened details and substantially improve the performance compared with the existing state-of-the-art methods. Besides, our approach is fast and can run at a speed of 53 FPS when processing a 300 ×400 image. To make our approach better applied to mobile applications, we take MobileNetV2 as our backbone and re-tailor the structure of our pooling-based modules. Our mobile version model achieves a running speed of 66 FPS yet still performs better than most existing state-of-the-art methods. To verify the generalization ability of the proposed method, we apply it to the edge detection, RGB-D salient object detection, and camouflaged object detection tasks, and our method achieves better results than the corresponding state-of-the-art methods of these three tasks. Code can be found at http://mmcheng.net/poolnet/.

17.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 12760-12771, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36040936

RESUMO

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

18.
Artigo em Inglês | MEDLINE | ID: mdl-37030762

RESUMO

Caustics are challenging light transport effects for photo-realistic rendering. Photon mapping techniques play a fundamental role in rendering caustics. However, photon mapping methods render single caustics under the stationary light source in a fixed scene view. They require significant storage and computing resources to produce high-quality results. In this paper, we propose efficiently rendering more diverse caustics of a scene with the camera and the light source moving. We present a novel learning-based volume rendering approach with implicit representations for our proposed task. Considering the variety of materials and textures of planar caustic receivers, we decompose the output appearance into two components: the diffuse and specular parts with a probabilistic module. Unlike NeRF, we construct weights for rendering each component from the implicit signed distance function (SDF). Moreover, we introduce the centering calibration and the sine activation function to improve the performance of the color prediction network. Extensive experiments on the synthetic and real-world datasets illustrate that our method achieves much better performance than baselines in the quantitative and qualitative comparison, for rendering caustics in novel views with the dynamic light source. Especially, our method outperforms the baseline on the temporal consistency across frames. Code will be available at https://github.com/JiaxiongQ/NeRC.

19.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8193-8205, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37018612

RESUMO

Co-salient object detection (Co-SOD) aims at discovering the common objects in a group of relevant images. Mining a co-representation is essential for locating co-salient objects. Unfortunately, the current Co-SOD method does not pay enough attention that the information not related to the co-salient object is included in the co-representation. Such irrelevant information in the co-representation interferes with its locating of co-salient objects. In this paper, we propose a Co-Representation Purification (CoRP) method aiming at searching noise-free co-representation. We search a few pixel-wise embeddings probably belonging to co-salient regions. These embeddings constitute our co-representation and guide our prediction. For obtaining purer co-representation, we use the prediction to iteratively reduce irrelevant embeddings in our co-representation. Experiments on three datasets demonstrate that our CoRP achieves state-of-the-art performances on the benchmark datasets. Our source code is available at https://github.com/ZZY816/CoRP.

20.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 1328-1334, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-35077359

RESUMO

In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies and meanwhile avoid the attention building process in transformers. The outputs are then aggregated in a mutually complementing manner to form expressive representations. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy, greatly improving the performance of recent state-of-the-art MLP-like networks for visual recognition. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. PyTorch/MindSpore/Jittor code is available at https://github.com/Andrew-Qibin/VisionPermutator.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA