RESUMEN
The morphological analysis and volume measurement of the hippocampus are crucial to the study of many brain diseases. Therefore, an accurate hippocampal segmentation method is beneficial for the development of clinical research in brain diseases. U-Net and its variants have become prevalent in hippocampus segmentation of Magnetic Resonance Imaging (MRI) due to their effectiveness, and the architecture based on Transformer has also received some attention. However, some existing methods focus too much on the shape and volume of the hippocampus rather than its spatial information, and the extracted information is independent of each other, ignoring the correlation between local and global features. In addition, many methods cannot be effectively applied to practical medical image segmentation due to many parameters and high computational complexity. To this end, we combined the advantages of CNNs and ViTs (Vision Transformer) and proposed a simple and lightweight model: Light3DHS for the segmentation of the 3D hippocampus. In order to obtain richer local contextual features, the encoder first utilizes a multi-scale convolutional attention module (MCA) to learn the spatial information of the hippocampus. Considering the importance of local features and global semantics for 3D segmentation, we used a lightweight ViT to learn high-level features of scale invariance and further fuse local-to-global representation. To evaluate the effectiveness of encoder feature representation, we designed three decoders of different complexity to generate segmentation maps. Experiments on three common hippocampal datasets demonstrate that the network achieves more accurate hippocampus segmentation with fewer parameters. Light3DHS performs better than other state-of-the-art algorithms.
Asunto(s)
Hipocampo , Imagenología Tridimensional , Imagen por Resonancia Magnética , Hipocampo/diagnóstico por imagen , Humanos , Imagen por Resonancia Magnética/métodos , Imagenología Tridimensional/métodos , Redes Neurales de la Computación , Aprendizaje Profundo , AlgoritmosRESUMEN
BACKGROUND: Oral Squamous Cell Carcinoma (OSCC) presents significant diagnostic challenges in its early and late stages. This study aims to utilize preoperative MRI and biochemical indicators of OSCC patients to predict the stage of tumors. METHODS: This study involved 198 patients from two medical centers. A detailed analysis of contrast-enhanced T1-weighted (ceT1W) and T2-weighted (T2W) MRI were conducted, integrating these with biochemical indicators for a comprehensive evaluation. Initially, 42 clinical biochemical indicators were selected for consideration. Through univariate analysis and multivariate analysis, only those indicators with p-values less than 0.05 were retained for model development. To extract imaging features, machine learning algorithms in conjunction with Vision Transformer (ViT) techniques were utilized. These features were integrated with biochemical indicators for predictive modeling. The performance of model was evaluated using the Receiver Operating Characteristic (ROC) curve. RESULTS: After rigorously screening biochemical indicators, four key markers were selected for the model: cholesterol, triglyceride, very low-density lipoprotein cholesterol and chloride. The model, developed using radiomics and deep learning for feature extraction from ceT1W and T2W images, showed a lower Area Under the Curve (AUC) of 0.85 in the validation cohort when using these imaging modalities alone. However, integrating these biochemical indicators improved the model's performance, increasing the validation cohort AUC to 0.87. CONCLUSION: In this study, the performance of the model significantly improved following multimodal fusion, outperforming the single-modality approach. CLINICAL RELEVANCE STATEMENT: This integration of radiomics, ViT models, and lipid metabolite analysis, presents a promising non-invasive technique for predicting the staging of OSCC.
Asunto(s)
Neoplasias de la Boca , Estadificación de Neoplasias , Carcinoma de Células Escamosas de Cabeza y Cuello , Adulto , Anciano , Femenino , Humanos , Masculino , Persona de Mediana Edad , Biomarcadores de Tumor , Lípidos/sangre , Aprendizaje Automático , Imagen por Resonancia Magnética/métodos , Neoplasias de la Boca/diagnóstico por imagen , Neoplasias de la Boca/patología , Radiómica , Curva ROC , Carcinoma de Células Escamosas de Cabeza y Cuello/diagnóstico por imagen , Carcinoma de Células Escamosas de Cabeza y Cuello/patologíaRESUMEN
PURPOSE: This study aimed to perform multimodal analysis by vision transformer (vViT) in predicting O6-methylguanine-DNA methyl transferase (MGMT) promoter status among adult patients with diffuse glioma using demographics (sex and age), radiomic features, and MRI. METHODS: The training and test datasets contained 122 patients with 1,570 images and 30 patients with 484 images, respectively. The radiomic features were extracted from enhancing tumors (ET), necrotic tumor cores (NCR), and the peritumoral edematous/infiltrated tissues (ED) using contrast-enhanced T1-weighted images (CE-T1WI) and T2-weighted images (T2WI). The vViT had 9 sectors; 1 demographic sector, 6 radiomic sectors (CE-T1WI ET, CE-T1WI NCR, CE-T1WI ED, T2WI ET, T2WI NCR, and T2WI ED), 2 image sectors (CE-T1WI, and T2WI). Accuracy and area under the curve of receiver-operating characteristics (AUC-ROC) were calculated for the test dataset. The performance of vViT was compared with AlexNet, GoogleNet, VGG16, and ResNet by McNemar and Delong test. Permutation importance (PI) analysis with the Mann-Whitney U test was performed. RESULTS: The accuracy was 0.833 (95% confidence interval [95%CI]: 0.714-0.877) and the area under the curve of receiver-operating characteristics was 0.840 (0.650-0.995) in the patient-based analysis. The vViT had higher accuracy than VGG16 and ResNet, and had higher AUC-ROC than GoogleNet (p<0.05). The ED radiomic features extracted from the T2-weighted image demonstrated the highest importance (PI=0.239, 95%CI: 0.237-0.240) among all other sectors (p<0.0001). CONCLUSION: The vViT is a competent deep learning model in predicting MGMT status. The ED radiomic features of the T2-weighted image demonstrated the most dominant contribution.
Asunto(s)
Neoplasias Encefálicas , Glioma , Guanina/análogos & derivados , Adulto , Humanos , Neoplasias Encefálicas/patología , Radiómica , Glioma/patología , Imagen por Resonancia Magnética/métodos , Demografía , Estudios RetrospectivosRESUMEN
BACKGROUND: The accurate detection of eyelid tumors is essential for effective treatment, but it can be challenging due to small and unevenly distributed lesions surrounded by irrelevant noise. Moreover, early symptoms of eyelid tumors are atypical, and some categories of eyelid tumors exhibit similar color and texture features, making it difficult to distinguish between benign and malignant eyelid tumors, particularly for ophthalmologists with limited clinical experience. METHODS: We propose a hybrid model, HM_ADET, for automatic detection of eyelid tumors, including YOLOv7_CNFG to locate eyelid tumors and vision transformer (ViT) to classify benign and malignant eyelid tumors. First, the ConvNeXt module with an inverted bottleneck layer in the backbone of YOLOv7_CNFG is employed to prevent information loss of small eyelid tumors. Then, the flexible rectified linear unit (FReLU) is applied to capture multi-scale features such as texture, edge, and shape, thereby improving the localization accuracy of eyelid tumors. In addition, considering the geometric center and area difference between the predicted box (PB) and the ground truth box (GT), the GIoU_loss was utilized to handle cases of eyelid tumors with varying shapes and irregular boundaries. Finally, the multi-head attention (MHA) module is applied in ViT to extract discriminative features of eyelid tumors for benign and malignant classification. RESULTS: Experimental results demonstrate that the HM_ADET model achieves excellent performance in the detection of eyelid tumors. In specific, YOLOv7_CNFG outperforms YOLOv7, with AP increasing from 0.763 to 0.893 on the internal test set and from 0.647 to 0.765 on the external test set. ViT achieves AUCs of 0.945 (95% CI 0.894-0.981) and 0.915 (95% CI 0.860-0.955) for the classification of benign and malignant tumors on the internal and external test sets, respectively. CONCLUSIONS: Our study provides a promising strategy for the automatic diagnosis of eyelid tumors, which could potentially improve patient outcomes and reduce healthcare costs.
Asunto(s)
Neoplasias de los Párpados , Humanos , Neoplasias de los Párpados/diagnóstico , Área Bajo la Curva , Costos de la Atención en SaludRESUMEN
Renal tumors are one of the common diseases of urology, and precise segmentation of these tumors plays a crucial role in aiding physicians to improve diagnostic accuracy and treatment effectiveness. Nevertheless, inherent challenges associated with renal tumors, such as indistinct boundaries, morphological variations, and uncertainties in size and location, segmenting renal tumors accurately remains a significant challenge in the field of medical image segmentation. With the development of deep learning, substantial achievements have been made in the domain of medical image segmentation. However, existing models lack specificity in extracting features of renal tumors across different network hierarchies, which results in insufficient extraction of renal tumor features and subsequently affects the accuracy of renal tumor segmentation. To address this issue, we propose the Selective Kernel, Vision Transformer, and Coordinate Attention Enhanced U-Net (STC-UNet). This model aims to enhance feature extraction, adapting to the distinctive characteristics of renal tumors across various network levels. Specifically, the Selective Kernel modules are introduced in the shallow layers of the U-Net, where detailed features are more abundant. By selectively employing convolutional kernels of different scales, the model enhances its capability to extract detailed features of renal tumors across multiple scales. Subsequently, in the deeper layers of the network, where feature maps are smaller yet contain rich semantic information, the Vision Transformer modules are integrated in a non-patch manner. These assist the model in capturing long-range contextual information globally. Their non-patch implementation facilitates the capture of fine-grained features, thereby achieving collaborative enhancement of global-local information and ultimately strengthening the model's extraction of semantic features of renal tumors. Finally, in the decoder segment, the Coordinate Attention modules embedding positional information are proposed aiming to enhance the model's feature recovery and tumor region localization capabilities. Our model is validated on the KiTS19 dataset, and experimental results indicate that compared to the baseline model, STC-UNet shows improvements of 1.60%, 2.02%, 2.27%, 1.18%, 1.52%, and 1.35% in IoU, Dice, Accuracy, Precision, Recall, and F1-score, respectively. Furthermore, the experimental results demonstrate that the proposed STC-UNet method surpasses other advanced algorithms in both visual effectiveness and objective evaluation metrics.
Asunto(s)
Aprendizaje Profundo , Neoplasias Renales , Humanos , Neoplasias Renales/diagnóstico por imagen , Neoplasias Renales/patología , Neoplasias Renales/cirugía , Algoritmos , Redes Neurales de la Computación , Tomografía Computarizada por Rayos X/métodos , Interpretación de Imagen Asistida por Computador/métodosRESUMEN
BACKGROUND: Skin cancer is one of the highly occurring diseases in human life. Early detection and treatment are the prime and necessary points to reduce the malignancy of infections. Deep learning techniques are supplementary tools to assist clinical experts in detecting and localizing skin lesions. Vision transformers (ViT) based on image segmentation classification using multiple classes provide fairly accurate detection and are gaining more popularity due to legitimate multiclass prediction capabilities. MATERIALS AND METHODS: In this research, we propose a new ViT Gradient-Weighted Class Activation Mapping (GradCAM) based architecture named ViT-GradCAM for detecting and classifying skin lesions by spreading ratio on the lesion's surface area. The proposed system is trained and validated using a HAM 10000 dataset by studying seven skin lesions. The database comprises 10 015 dermatoscopic images of varied sizes. The data preprocessing and data augmentation techniques are applied to overcome the class imbalance issues and improve the model's performance. RESULT: The proposed algorithm is based on ViT models that classify the dermatoscopic images into seven classes with an accuracy of 97.28%, precision of 98.51, recall of 95.2%, and an F1 score of 94.6, respectively. The proposed ViT-GradCAM obtains better and more accurate detection and classification than other state-of-the-art deep learning-based skin lesion detection models. The architecture of ViT-GradCAM is extensively visualized to highlight the actual pixels in essential regions associated with skin-specific pathologies. CONCLUSION: This research proposes an alternate solution to overcome the challenges of detecting and classifying skin lesions using ViTs and GradCAM, which play a significant role in detecting and classifying skin lesions accurately rather than relying solely on deep learning models.
Asunto(s)
Algoritmos , Aprendizaje Profundo , Dermoscopía , Neoplasias Cutáneas , Humanos , Dermoscopía/métodos , Neoplasias Cutáneas/diagnóstico por imagen , Neoplasias Cutáneas/clasificación , Neoplasias Cutáneas/patología , Interpretación de Imagen Asistida por Computador/métodos , Bases de Datos Factuales , Piel/diagnóstico por imagen , Piel/patologíaRESUMEN
Predicting human's gaze from egocentric videos serves as a critical role for human intention understanding in daily activities. In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel global-local correlation module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets - EGTEA Gaze + and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds the previous state-of-the-art model by a large margin. We also apply our model to a novel gaze saccade/fixation prediction task and the traditional action recognition problem. The consistent gains suggest the strong generalization capability of our model. We also provide additional visualizations to support our claim that global-local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (https://bolinlai.github.io/GLC-EgoGazeEst).
RESUMEN
BACKGROUND: Recent advances in Vision Transformer (ViT)-based deep learning have significantly improved the accuracy of lung disease prediction from chest X-ray images. However, limited research exists on comparing the effectiveness of different optimizers for lung disease prediction within ViT models. This study aims to systematically evaluate and compare the performance of various optimization methods for ViT-based models in predicting lung diseases from chest X-ray images. METHODS: This study utilized a chest X-ray image dataset comprising 19,003 images containing both normal cases and six lung diseases: COVID-19, Viral Pneumonia, Bacterial Pneumonia, Middle East Respiratory Syndrome (MERS), Severe Acute Respiratory Syndrome (SARS), and Tuberculosis. Each ViT model (ViT, FastViT, and CrossViT) was individually trained with each optimization method (Adam, AdamW, NAdam, RAdam, SGDW, and Momentum) to assess their performance in lung disease prediction. RESULTS: When tested with ViT on the dataset with balanced-sample sized classes, RAdam demonstrated superior accuracy compared to other optimizers, achieving 95.87%. In the dataset with imbalanced sample size, FastViT with NAdam achieved the best performance with an accuracy of 97.63%. CONCLUSIONS: We provide comprehensive optimization strategies for developing ViT-based model architectures, which can enhance the performance of these models for lung disease prediction from chest X-ray images.
Asunto(s)
Aprendizaje Profundo , Enfermedades Pulmonares , Humanos , Enfermedades Pulmonares/diagnóstico por imagen , Radiografía Torácica/métodos , Radiografía Torácica/normas , COVID-19/diagnóstico por imagenRESUMEN
BACKGROUND: Histopathology is a gold standard for cancer diagnosis. It involves extracting tissue specimens from suspicious areas to prepare a glass slide for a microscopic examination. However, histological tissue processing procedures result in the introduction of artifacts, which are ultimately transferred to the digitized version of glass slides, known as whole slide images (WSIs). Artifacts are diagnostically irrelevant areas and may result in wrong predictions from deep learning (DL) algorithms. Therefore, detecting and excluding artifacts in the computational pathology (CPATH) system is essential for reliable automated diagnosis. METHODS: In this paper, we propose a mixture of experts (MoE) scheme for detecting five notable artifacts, including damaged tissue, blur, folded tissue, air bubbles, and histologically irrelevant blood from WSIs. First, we train independent binary DL models as experts to capture particular artifact morphology. Then, we ensemble their predictions using a fusion mechanism. We apply probabilistic thresholding over the final probability distribution to improve the sensitivity of the MoE. We developed four DL pipelines to evaluate computational and performance trade-offs. These include two MoEs and two multiclass models of state-of-the-art deep convolutional neural networks (DCNNs) and vision transformers (ViTs). These DL pipelines are quantitatively and qualitatively evaluated on external and out-of-distribution (OoD) data to assess generalizability and robustness for artifact detection application. RESULTS: We extensively evaluated the proposed MoE and multiclass models. DCNNs-based MoE and ViTs-based MoE schemes outperformed simpler multiclass models and were tested on datasets from different hospitals and cancer types, where MoE using (MobileNet) DCNNs yielded the best results. The proposed MoE yields 86.15 % F1 and 97.93% sensitivity scores on unseen data, retaining less computational cost for inference than MoE using ViTs. This best performance of MoEs comes with relatively higher computational trade-offs than multiclass models. Furthermore, we apply post-processing to create an artifact segmentation mask, a potential artifact-free RoI map, a quality report, and an artifact-refined WSI for further computational analysis. During the qualitative evaluation, field experts assessed the predictive performance of MoEs over OoD WSIs. They rated artifact detection and artifact-free area preservation, where the highest agreement translated to a Cohen Kappa of 0.82, indicating substantial agreement for the overall diagnostic usability of the DCNN-based MoE scheme. CONCLUSIONS: The proposed artifact detection pipeline will not only ensure reliable CPATH predictions but may also provide quality control. In this work, the best-performing pipeline for artifact detection is MoE with DCNNs. Our detailed experiments show that there is always a trade-off between performance and computational complexity, and no straightforward DL solution equally suits all types of data and applications. The code and HistoArtifacts dataset can be found online at Github and Zenodo , respectively.
Asunto(s)
Artefactos , Aprendizaje Profundo , Humanos , Neoplasias , Procesamiento de Imagen Asistido por Computador/métodos , Patología Clínica/normas , Interpretación de Imagen Asistida por Computador/métodosRESUMEN
BACKGROUND: Maxillary expansion is an important treatment method for maxillary transverse hypoplasia. Different methods of maxillary expansion should be carried out depending on the midpalatal suture maturation levels, and the diagnosis was validated by palatal plane cone beam computed tomography (CBCT) images by orthodontists, while such a method suffered from low efficiency and strong subjectivity. This study develops and evaluates an enhanced vision transformer (ViT) to automatically classify CBCT images of midpalatal sutures with different maturation stages. METHODS: In recent years, the use of convolutional neural network (CNN) to classify images of midpalatal suture with different maturation stages has brought positive significance to the decision of the clinical maxillary expansion method. However, CNN cannot adequately learn the long-distance dependencies between images and features, which are also required for global recognition of midpalatal suture CBCT images. The Self-Attention of ViT has the function of capturing the relationship between long-distance pixels of the image. However, it lacks the inductive bias of CNN and needs more data training. To solve this problem, a CNN-enhanced ViT model based on transfer learning is proposed to classify midpalatal suture CBCT images. In this study, 2518 CBCT images of the palate plane are collected, and the images are divided into 1259 images as the training set, 506 images as the verification set, and 753 images as the test set. After the training set image preprocessing, the CNN-enhanced ViT model is trained and adjusted, and the generalization ability of the model is tested on the test set. RESULTS: The classification accuracy of our proposed ViT model is 95.75%, and its Macro-averaging Area under the receiver operating characteristic Curve (AUC) and Micro-averaging AUC are 97.89% and 98.36% respectively on our data test set. The classification accuracy of the best performing CNN model EfficientnetV2_S was 93.76% on our data test set. The classification accuracy of the clinician is 89.10% on our data test set. CONCLUSIONS: The experimental results show that this method can effectively complete CBCT images classification of midpalatal suture maturation stages, and the performance is better than a clinician. Therefore, the model can provide a valuable reference for orthodontists and assist them in making correct a diagnosis.
Asunto(s)
Tomografía Computarizada de Haz Cónico , Redes Neurales de la Computación , Humanos , Suturas Craneales/diagnóstico por imagen , Técnica de Expansión Palatina , Hueso Paladar/diagnóstico por imagen , Aprendizaje AutomáticoRESUMEN
PURPOSE: To study the effectiveness of whole-scenario embryo identification using a self-supervised learning encoder (WISE) in in vitro fertilization (IVF) on time-lapse, cross-device, and cryo-thawed scenarios. METHODS: WISE was based on the vision transformer (ViT) architecture and masked autoencoders (MAE), a self-supervised learning (SSL) method. To train WISE, we prepared three datasets including the SSL pre-training dataset, the time-lapse identification dataset, and the cross-device identification dataset. To identify whether pairs of images were from the same embryos in different scenarios in the downstream identification tasks, embryo images including time-lapse and microscope images were first pre-processed through object detection, cropping, padding, and resizing, and then fed into WISE to get predictions. RESULTS: WISE could accurately identify embryos in the three scenarios. The accuracy was 99.89% on the time-lapse identification dataset, and 83.55% on the cross-device identification dataset. Besides, we subdivided a cryo-thawed evaluation set from the cross-device test set to have a better estimation of how WISE performs in the real-world, and it reached an accuracy of 82.22%. There were approximately 10% improvements in cross-device and cryo-thawed identification tasks after the SSL method was applied. Besides, WISE demonstrated improvements in the accuracy of 9.5%, 12%, and 18% over embryologists in the three scenarios. CONCLUSION: SSL methods can improve embryo identification accuracy even when dealing with cross-device and cryo-thawed paired images. The study is the first to apply SSL in embryo identification, and the results show the promise of WISE for future application in embryo witnessing.
Asunto(s)
Fertilización In Vitro , Imagen de Lapso de Tiempo , Humanos , Fertilización In Vitro/métodos , Femenino , Imagen de Lapso de Tiempo/métodos , Aprendizaje Automático Supervisado , Embrión de Mamíferos , Embarazo , Procesamiento de Imagen Asistido por Computador/métodos , Blastocisto/citología , Blastocisto/fisiología , Transferencia de Embrión/métodos , Criopreservación/métodosRESUMEN
The purpose of this study is to establish a deep learning automatic assistance diagnosis system for benign and malignant classification of mediastinal lesions in endobronchial ultrasound (EBUS) images. EBUS images are in the form of video and contain multiple imaging modes. Different imaging modes and different frames can reflect the different characteristics of lesions. Compared with previous studies, the proposed model can efficiently extract and integrate the spatiotemporal relationships between different modes and does not require manual selection of representative frames. In recent years, Vision Transformer has received much attention in the field of computer vision. Combined with convolutional neural networks, hybrid transformers can also perform well on small datasets. This study designed a novel deep learning architecture based on hybrid transformer called TransEBUS. By adding learnable parameters in the temporal dimension, TransEBUS was able to extract spatiotemporal features from insufficient data. In addition, we designed a two-stream module to integrate information from three different imaging modes of EBUS. Furthermore, we applied contrastive learning when training TransEBUS, enabling it to learn discriminative representation of benign and malignant mediastinal lesions. The results show that TransEBUS achieved a diagnostic accuracy of 82% and an area under the curve of 0.8812 in the test dataset, outperforming other methods. It also shows that several models can improve performance by incorporating two-stream module. Our proposed system has shown its potential to help physicians distinguishing benign and malignant mediastinal lesions, thereby ensuring the accuracy of EBUS examination.
RESUMEN
Human-object interaction (HOI) detection identifies a "set of interactions" in an image involving the recognition of interacting instances and the classification of interaction categories. The complexity and variety of image content make this task challenging. Recently, the Transformer has been applied in computer vision and received attention in the HOI detection task. Therefore, this paper proposes a novel Part Refinement Tandem Transformer (PRTT) for HOI detection. Unlike the previous Transformer-based HOI method, PRTT utilizes multiple decoders to split and process rich elements of HOI prediction and introduces a new part state feature extraction (PSFE) module to help improve the final interaction category classification. We adopt a novel prior feature integrated cross-attention (PFIC) to utilize the fine-grained partial state semantic and appearance feature output obtained by the PSFE module to guide queries. We validate our method on two public datasets, V-COCO and HICO-DET. Compared to state-of-the-art models, the performance of detecting human-object interaction is significantly improved by the PRTT.
RESUMEN
In online video understanding, which has a wide range of real-world applications, inference speed is crucial. Many approaches involve frame-level visual feature extraction, which often represents the biggest bottleneck. We propose RetinaViT, an efficient method for extracting frame-level visual features in an online video stream, aiming to fundamentally enhance the efficiency of online video understanding tasks. RetinaViT is composed of efficiently approximated Transformer blocks that only take changed tokens (event tokens) as queries and reuse the already processed tokens from the previous timestep for the others. Furthermore, we restrict keys and values to the spatial neighborhoods of event tokens to further improve efficiency. RetinaViT involves tuning multiple parameters, which we determine through a multi-step process. During model training, we randomly vary these parameters and then perform black-box optimization to maximize accuracy and efficiency on the pre-trained model. We conducted extensive experiments on various online video recognition tasks, including action recognition, pose estimation, and object segmentation, validating the effectiveness of each component in RetinaViT and demonstrating improvements in the speed/accuracy trade-off compared to baselines. In particular, for action recognition, RetinaViT built on ViT-B16 reduces inference time by approximately 61.9% on the CPU and 50.8% on the GPU, while achieving slight accuracy improvements rather than degradation.
RESUMEN
Automated segmentation algorithms for dermoscopic images serve as effective tools that assist dermatologists in clinical diagnosis. While existing deep learning-based skin lesion segmentation algorithms have achieved certain success, challenges remain in accurately delineating the boundaries of lesion regions in dermoscopic images with irregular shapes, blurry edges, and occlusions by artifacts. To address these issues, a multi-attention codec network with selective and dynamic fusion (MASDF-Net) is proposed for skin lesion segmentation in this study. In this network, we use the pyramid vision transformer as the encoder to model the long-range dependencies between features, and we innovatively designed three modules to further enhance the performance of the network. Specifically, the multi-attention fusion (MAF) module allows for attention to be focused on high-level features from various perspectives, thereby capturing more global contextual information. The selective information gathering (SIG) module improves the existing skip-connection structure by eliminating the redundant information in low-level features. The multi-scale cascade fusion (MSCF) module dynamically fuses features from different levels of the decoder part, further refining the segmentation boundaries. We conducted comprehensive experiments on the ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. The experimental results demonstrate the superiority of our approach over existing state-of-the-art methods.
Asunto(s)
Algoritmos , Redes Neurales de la Computación , Humanos , Aprendizaje Profundo , Dermoscopía/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Piel/diagnóstico por imagen , Piel/patología , Interpretación de Imagen Asistida por Computador/métodosRESUMEN
Visual transformers (ViTs) are widely used in various visual tasks, such as fine-grained visual classification (FGVC). However, the self-attention mechanism, which is the core module of visual transformers, leads to quadratic computational and memory complexity. The sparse-attention and local-attention approaches currently used by most researchers are not suitable for FGVC tasks. These tasks require dense feature extraction and global dependency modeling. To address this challenge, we propose a dual-dependency attention transformer model. It decouples global token interactions into two paths. The first is a position-dependency attention pathway based on the intersection of two types of grouped attention. The second is a semantic dependency attention pathway based on dynamic central aggregation. This approach enhances the high-quality semantic modeling of discriminative cues while reducing the computational cost to linear computational complexity. In addition, we develop discriminative enhancement strategies. These strategies increase the sensitivity of high-confidence discriminative cue tracking with a knowledge-based representation approach. Experiments on three datasets, NABIRDS, CUB, and DOGS, show that the method is suitable for fine-grained image classification. It finds a balance between computational cost and performance.
RESUMEN
Existing industrial image anomaly detection techniques predominantly utilize codecs based on convolutional neural networks (CNNs). However, traditional convolutional autoencoders are limited to local features, struggling to assimilate global feature information. CNNs' generalizability enables the reconstruction of certain anomalous regions. This is particularly evident when normal and abnormal regions, despite having similar pixel values, contain different semantic information, leading to ineffective anomaly detection. Furthermore, collecting abnormal image samples during actual industrial production poses challenges, often resulting in data imbalance. To mitigate these issues, this study proposes an unsupervised anomaly detection model employing the Vision Transformer (ViT) architecture, incorporating a Transformer structure to understand the global context between image blocks, thereby extracting a superior representation of feature information. It integrates a memory module to catalog normal sample features, both to counteract anomaly reconstruction issues and bolster feature representation, and additionally introduces a coordinate attention (CA) mechanism to intensify focus on image features at both spatial and channel dimensions, minimizing feature information loss and thereby enabling more precise anomaly identification and localization. Experiments conducted on two public datasets, MVTec AD and BeanTech AD, substantiate the method's effectiveness, demonstrating an approximate 20% improvement in average AUROC% at the image level over traditional convolutional encoders.
RESUMEN
Most single-object trackers currently employ either a convolutional neural network (CNN) or a vision transformer as the backbone for object tracking. In CNNs, convolutional operations excel at extracting local features but struggle to capture global representations. On the other hand, vision transformers utilize cascaded self-attention modules to capture long-range feature dependencies but may overlook local feature details. To address these limitations, we propose a target-tracking algorithm called CVTrack, which leverages a parallel dual-branch backbone network combining CNN and Transformer for feature extraction and fusion. Firstly, CVTrack utilizes a parallel dual-branch feature extraction network with CNN and transformer branches to extract local and global features from the input image. Through bidirectional information interaction channels, the local features from the CNN branch and the global features from the transformer branch are able to interact and fuse information effectively. Secondly, deep cross-correlation operations and transformer-based methods are employed to fuse the template and search region features, enabling comprehensive interaction between them. Subsequently, the fused features are fed into the prediction module to accomplish the object-tracking task. Our tracker achieves state-of-the-art performance on five benchmark datasets while maintaining real-time execution speed. Finally, we conduct ablation studies to demonstrate the efficacy of each module in the parallel dual-branch feature extraction backbone network.
RESUMEN
Visual object tracking, pivotal for applications like earth observation and environmental monitoring, encounters challenges under adverse conditions such as low light and complex backgrounds. Traditional tracking technologies often falter, especially when tracking dynamic objects like aircraft amidst rapid movements and environmental disturbances. This study introduces an innovative adaptive multimodal image object-tracking model that harnesses the capabilities of multispectral image sensors, combining infrared and visible light imagery to significantly enhance tracking accuracy and robustness. By employing the advanced vision transformer architecture and integrating token spatial filtering (TSF) and crossmodal compensation (CMC), our model dynamically adjusts to diverse tracking scenarios. Comprehensive experiments conducted on a private dataset and various public datasets demonstrate the model's superior performance under extreme conditions, affirming its adaptability to rapid environmental changes and sensor limitations. This research not only advances visual tracking technology but also offers extensive insights into multisource image fusion and adaptive tracking strategies, establishing a robust foundation for future enhancements in sensor-based tracking systems.
RESUMEN
Fault diagnosis is one of the important applications of edge computing in the Industrial Internet of Things (IIoT). To address the issue that traditional fault diagnosis methods often struggle to effectively extract fault features, this paper proposes a novel rolling bearing fault diagnosis method that integrates Gramian Angular Field (GAF), Convolutional Neural Network (CNN), and Vision Transformer (ViT). First, GAF is used to convert one-dimensional vibration signals from sensors into two-dimensional images, effectively retaining the fault features of the vibration signal. Then, the CNN branch is used to extract the local features of the image, which are combined with the global features extracted by the ViT branch to diagnose the bearing fault. The effectiveness of this method is validated with two datasets. Experimental results show that the proposed method achieves average accuracies of 99.79% and 99.63% on the CWRU and XJTU-SY rolling bearing fault datasets, respectively. Compared with several widely used fault diagnosis methods, the proposed method achieves higher accuracy for different fault classifications, providing reliable technical support for performing complex fault diagnosis on edge devices.