RESUMO
Urolithiasis is a leading urological disorder where accurate preoperative identification of stone types is critical for effective treatment. Deep learning has shown promise in classifying urolithiasis from CT images, yet faces challenges with model size and computational efficiency in real clinical settings. To address these challenges, we developed a non-invasive prediction approach for determining urinary stone types based on CT images. Through the refinement and improvement of the self-distillation architecture, coupled with the incorporation of feature fusion and the Coordinate Attention Module (CAM), we facilitated a more effective and thorough knowledge transfer. This method circumvents the extra computational expenses and performance reduction linked with model compression and removes the reliance on external teacher models, markedly enhancing the efficacy of lightweight models. achieved a classification accuracy of 74.96% on a proprietary dataset, outperforming current techniques. Furthermore, our method demonstrated superior performance and generalizability on two public datasets. This not only validates the effectiveness of our approach in classifying urinary stones but also showcases its potential in other medical image processing tasks. These results further reinforce the feasibility of our model for actual clinical deployment, potentially assisting healthcare professionals in devising more precise treatment plans and reducing patient discomfort.
Assuntos
Aprendizado Profundo , Tomografia Computadorizada por Raios X , Cálculos Urinários , Humanos , Processamento de Imagem Assistida por Computador/métodos , AlgoritmosRESUMO
Auditory Attention Detection (AAD) aims to detect the target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural networks designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean characteristics. In order to address this problem, this paper proposes a dynamical graph self-distillation (DGSD) approach for AAD, which does not require speech stimuli as input. Specifically, to effectively represent the non-Euclidean properties of EEG signals, dynamical graph convolutional networks are applied to represent the graph structure of EEG signals, which can also extract crucial features related to auditory spatial attention in EEG signals. In addition, to further improve AAD detection performance, self-distillation, consisting of feature distillation and hierarchical distillation strategies at each layer, is integrated. These strategies leverage features and classification results from the deepest network layers to guide the learning of shallow layers. Our experiments are conducted on two publicly available datasets, KUL and DTU. Under a 1-second time window, we achieve results of 90.0% and 79.6% accuracy on KUL and DTU, respectively. We compare our DGSD method with competitive baselines, and the experimental results indicate that the detection performance of our proposed DGSD method is not only superior to the best reproducible baseline but also significantly reduces the number of trainable parameters by approximately 100 times.
Assuntos
Atenção , Eletroencefalografia , Redes Neurais de Computação , Eletroencefalografia/métodos , Humanos , Atenção/fisiologia , Percepção Auditiva/fisiologia , Encéfalo/fisiologia , Estimulação Acústica/métodos , AlgoritmosRESUMO
Self-supervised monocular depth estimation can exhibit excellent performance in static environments due to the multi-view consistency assumption during the training process. However, it is hard to maintain depth consistency in dynamic scenes when considering the occlusion problem caused by moving objects. For this reason, we propose a method of self-supervised self-distillation for monocular depth estimation (SS-MDE) in dynamic scenes, where a deep network with a multi-scale decoder and a lightweight pose network are designed to predict depth in a self-supervised manner via the disparity, motion information, and the association between two adjacent frames in the image sequence. Meanwhile, in order to improve the depth estimation accuracy of static areas, the pseudo-depth images generated by the LeReS network are used to provide the pseudo-supervision information, enhancing the effect of depth refinement in static areas. Furthermore, a forgetting factor is leveraged to alleviate the dependency on the pseudo-supervision. In addition, a teacher model is introduced to generate depth prior information, and a multi-view mask filter module is designed to implement feature extraction and noise filtering. This can enable the student model to better learn the deep structure of dynamic scenes, enhancing the generalization and robustness of the entire model in a self-distillation manner. Finally, on four public data datasets, the performance of the proposed SS-MDE method outperformed several state-of-the-art monocular depth estimation techniques, achieving an accuracy (δ1) of 89% while minimizing the error (AbsRel) by 0.102 in NYU-Depth V2 and achieving an accuracy (δ1) of 87% while minimizing the error (AbsRel) by 0.111 in KITTI.
RESUMO
Occluded person re-identification (Re-ID) is a challenging task, as pedestrians are often obstructed by various occlusions, such as non-pedestrian objects or non-target pedestrians. Previous methods have heavily relied on auxiliary models to obtain information in unoccluded regions, such as human pose estimation. However, these auxiliary models fall short in accounting for pedestrian occlusions, thereby leading to potential misrepresentations. In addition, some previous works learned feature representations from single images, ignoring the potential relations among samples. To address these issues, this paper introduces a Multi-Level Relation-Aware Transformer (MLRAT) model for occluded person Re-ID. This model mainly encompasses two novel modules: Patch-Level Relation-Aware (PLRA) and Sample-Level Relation-Aware (SLRA). PLRA learns fine-grained local features by modeling the structural relations between key patches, bypassing the dependency on auxiliary models. It adopts a model-free method to select key patches that have high semantic correlation with the final pedestrian representation. In particular, to alleviate the interference of occlusion, PLRA captures the structural relations among key patches via a two-layer Graph Convolution Network (GCN), effectively guiding the local feature fusion and learning. SLRA is designed to facilitate the model to learn discriminative features by modeling the relations among samples. Specifically, to mitigate noisy relations of irrelevant samples, we present a Relation-Aware Transformer (RAT) block to capture the relations among neighbors. Furthermore, to bridge the gap between training and testing phases, a self-distillation method is employed to transfer the sample-level relations captured by SLRA to the backbone. Extensive experiments are conducted on four occluded datasets, two partial datasets and two holistic datasets. The results show that the proposed MLRAT model significantly outperforms existing baselines on four occluded datasets, while maintains top performance on two partial datasets and two holistic datasets.
Assuntos
Redes Neurais de Computação , Pedestres , Humanos , AlgoritmosRESUMO
In the field of skin lesion image segmentation, accurate identification and partitioning of diseased regions is of vital importance for in-depth analysis of skin cancer. Self-supervised learning, i.e., MAE, has emerged as a potent force in the medical imaging domain, which autonomously learns and extracts latent features from unlabeled data, thereby yielding pre-trained models that greatly assist downstream tasks. To encourage pre-trained models to more comprehensively learn the global structural and local detail information inherent in dermoscopy images, we introduce a Teacher-Student architecture, named TEDMAE, by incorporating a self-distillation mechanism, it learns holistic image feature information to improve the generalizable global knowledge learning of the student MAE model. To make the image features learned by the model suitable for unknown test images, two optimization strategies are, Exterior Conversion Augmentation (EC) utilizes random convolutional kernels and linear interpolation to effectively transform the input image into one with the same shape but altered intensities and textures, while Dynamic Feature Generation (DF) employs a nonlinear attention mechanism for feature merging, enhancing the expressive power of the features, are proposed to enhance the generalizability of global features learned by the teacher model, thereby improving the overall generalization capability of the pre-trained models. Experimental results from the three public skin disease datasets, ISIC2019, ISIC2017, and PH 2 indicate that our proposed TEDMAE method outperforms several similar approaches. Specifically, TEDMAE demonstrated optimal segmentation and generalization performance on the ISIC2017 and PH 2 datasets, with Dice scores reaching 82.1% and 91.2%, respectively. The best Jaccard values were 72.6% and 84.5%, while the optimal HD95% values were 13.0% and 8.9%, respectively.
RESUMO
The objective of document-level relation extraction is to retrieve the relations existing between entities within a document. Currently, deep learning methods have demonstrated superior performance in document-level relation extraction tasks. However, to enhance the model's performance, various methods directly introduce additional modules into the backbone model, which often increases the number of parameters in the overall model. Consequently, deploying these deep models in resource-limited environments presents a challenge. In this article, we introduce a self-distillation framework for document-level relational extraction. We partition the document-level relation extraction model into two distinct modules, namely, the entity embedding representation module and the entity pair embedding representation module. Subsequently, we apply separate distillation techniques to each module to reduce the model's size. In order to evaluate the proposed framework's performance, two benchmark datasets for document-level relation extraction, namely GDA and DocRED are used in this study. The results demonstrate that our model effectively enhances performance and significantly reduces the model's size.
RESUMO
With the development of deep learning and sensors and sensor collection methods, computer vision inspection technology has developed rapidly. The deep-learning-based classification algorithm requires the acquisition of a model with superior generalization capabilities through the utilization of a substantial quantity of training samples. However, due to issues such as privacy, annotation costs, and sensor-captured images, how to make full use of limited samples has become a major challenge for practical training and deployment. Furthermore, when simulating models and transferring them to actual image scenarios, discrepancies often arise between the common training sets and the target domain (domain offset). Currently, meta-learning offers a promising solution for few-shot learning problems. However, the quantity of supporting set data on the target domain remains limited, leading to limited cross-domain learning effectiveness. To address this challenge, we have developed a self-distillation and mixing (SDM) method utilizing a Teacher-Student framework. This method effectively transfers knowledge from the source domain to the target domain by applying self-distillation techniques and mixed data augmentation, learning better image representations from relatively abundant datasets, and achieving fine-tuning in the target domain. In comparison with nine classical models, the experimental results demonstrate that the SDM method excels in terms of training time and accuracy. Furthermore, SDM effectively transfers knowledge from the source domain to the target domain, even with a limited number of target domain samples.
RESUMO
Deep neural networks tend to suffer from the overfitting issue when the training data are not enough. In this paper, we introduce two metrics from the intra-class distribution of correct-predicted and incorrect-predicted samples to provide a new perspective on the overfitting issue. Based on it, we propose a knowledge distillation approach without pretraining a teacher model in advance named Tolerant Self-Distillation (TSD) for alleviating the overfitting issue. It introduces an online updating memory and selectively stores the class predictions of the samples from the past iterations, making it possible to distill knowledge across the iterations. Specifically, the class predictions stored in the memory bank serve as the soft labels for supervising the samples from the same class for the current iteration in a reverse way, i.e. the correct-predicted samples are supervised with the incorrect predictions while the incorrect-predicted samples are supervised with the correct predictions. Consequently, the premature convergence issue caused by the over-confident samples would be mitigated, which helps the model to converge to a better local optimum. Extensive experimental results on several image classification benchmarks, including small-scale, large-scale, and fine-grained datasets, demonstrate the superiority of the proposed TSD.
Assuntos
Benchmarking , Conhecimento , Redes Neurais de ComputaçãoRESUMO
Person re-identification technology has made significant progress in recent years with the development of deep learning. However, the recognition rate of models in this field is still lower than that of face recognition, which is challenging to implement in practical application scenarios. Therefore, improving the recognition rate of the pedestrian re-identification model is still a critical task. This paper mainly focuses on three aspects of this problem. The first is to use the characteristics of the multi-branch network structure of person re-identification to dig out the most effective online self-distillation scheme between branches without increasing additional resource requirements, making full use of the information contained in each branch. Secondly, this paper analyzes and verifies the pros and cons of knowledge distillation based on mean squared error (MSE) loss function and Kullback-Leibler (KL) divergence from theoretical and experimental perspectives. Finally, we verified through experiments that adding a specific value of noise perturbation to the model weights can further improve the recognition rate of the model. After several improvements in these areas, we obtained the current state-of-the-art performance on four public datasets for person re-identification.
RESUMO
Rare diseases are characterized by low prevalence and are often chronically debilitating or life-threatening. Imaging phenotype classification of rare diseases is challenging due to the severe shortage of training examples. Few-shot learning (FSL) methods tackle this challenge by extracting generalizable prior knowledge from a large base dataset of common diseases and normal controls and transferring the knowledge to rare diseases. Yet, most existing methods require the base dataset to be labeled and do not make full use of the precious examples of rare diseases. In addition, the extremely small size of the training samples may result in inter-class performance imbalance due to insufficient sampling of the true distributions. To this end, we propose in this work a novel hybrid approach to rare disease imaging phenotype classification, featuring three key novelties targeted at the above drawbacks. First, we adopt the unsupervised representation learning (URL) based on self-supervising contrastive loss, whereby to eliminate the overhead in labeling the base dataset. Second, we integrate the URL with pseudo-label supervised classification for effective self-distillation of the knowledge about the rare diseases, composing a hybrid approach taking advantage of both unsupervised and (pseudo-) supervised learning on the base dataset. Third, we use the feature dispersion to assess the intra-class diversity of training samples, to alleviate the inter-class performance imbalance via dispersion-aware correction. Experimental results of imaging phenotype classification of both simulated (skin lesions and cervical smears) and real clinical rare diseases (retinal diseases) show that our hybrid approach substantially outperforms existing FSL methods (including those using a fully supervised base dataset) via effective integration of the URL, pseudo-label driven self-distillation, and dispersion-aware imbalance correction, thus establishing a new state of the art.
Assuntos
Doenças Raras , Doenças Retinianas , Humanos , Fenótipo , Diagnóstico por ImagemRESUMO
Early detection and accurate identification of thyroid nodules are the major challenges in controlling and treating thyroid cancer that can be difficult even for expert physicians. Currently, many computer-aided diagnosis (CAD) systems have been developed to assist this clinical process. However, most of these systems are unable to well capture geometrically diverse thyroid nodule representations from ultrasound images with subtle and various characteristic differences, resulting in suboptimal diagnosis and lack of clinical interpretability, which may affect their credibility in the clinic. In this context, a novel end-to-end network equipped with a deformable attention network and a distillation-driven interaction aggregation module (DIAM) is developed for thyroid nodule identification. The deformable attention network learns to identify discriminative features of nodules under the guidance of the deformable attention module (DAM) and an online class activation mapping (CAM) mechanism and suggests the location of diagnostic features to provide interpretable predictions. DIAM is designed to take advantage of the complementarities of adjacent layers, thus enhancing the representation capabilities of aggregated features; driven by an efficient self-distillation mechanism, the identification process is complemented with more multi-scale semantic information to calibrate the diagnosis results. Experimental results on a large dataset with varying nodule appearances show that the proposed network can achieve competitive performance in nodule diagnosis and provide interpretability suitable for clinical needs.
Assuntos
Nódulo da Glândula Tireoide , Humanos , Nódulo da Glândula Tireoide/diagnóstico por imagem , Destilação , Diagnóstico por Computador/métodos , Ultrassonografia/métodosRESUMO
Single-cell spatial transcriptomics such as in-situ hybridization or sequencing technologies can provide subcellular resolution that enables the identification of individual cell identities, locations, and a deep understanding of subcellular mechanisms. However, accurate segmentation and annotation that allows individual cell boundaries to be determined remains a major challenge that limits all the above and downstream insights. Current machine learning methods heavily rely on nuclei or cell body staining, resulting in the significant loss of both transcriptome depth and the limited ability to learn latent representations of spatial colocalization relationships. Here, we propose Bering, a graph deep learning model that leverages transcript colocalization relationships for joint noise-aware cell segmentation and molecular annotation in 2D and 3D spatial transcriptomics data. Graph embeddings for the cell annotation are transferred as a component of multi-modal input for cell segmentation, which is employed to enrich gene relationships throughout the process. To evaluate performance, we benchmarked Bering with state-of-the-art methods and observed significant improvement in cell segmentation accuracies and numbers of detected transcripts across various spatial technologies and tissues. To streamline segmentation processes, we constructed expansive pre-trained models, which yield high segmentation accuracy in new data through transfer learning and self-distillation, demonstrating the generalizability of Bering.
RESUMO
During percutaneous coronary intervention, the guiding catheter plays an important role. Tracking the catheter tip placed at the coronary ostium in the X-ray fluoroscopy sequence can obtain image displacement information caused by the heart beating, which can help dynamic coronary roadmap overlap on X-ray fluoroscopy images. Due to a low exposure dose, the X-ray fluoroscopy is noisy and low contrast, which causes some difficulties in tracking. In this paper, we developed a new catheter tip tracking framework. First, a lightweight efficient catheter tip segmentation network is proposed and boosted by a self-distillation training mechanism. Then, the Bayesian filtering post-processing method is used to consider the sequence information to refine the single image segmentation results. By separating the segmentation results into several groups based on connectivity, our framework can track multiple catheter tips. The proposed tracking framework is validated on a clinical X-ray sequence dataset.
Assuntos
Catéteres , Processamento de Imagem Assistida por Computador , Humanos , Raios X , Teorema de Bayes , Processamento de Imagem Assistida por Computador/métodos , Fluoroscopia/métodosRESUMO
Self-distillation methods utilize Kullback-Leibler divergence (KL) loss to transfer the knowledge from the network itself, which can improve the model performance without increasing computational resources and complexity. However, when applied to salient object detection (SOD), it is difficult to effectively transfer knowledge using KL. In order to improve SOD model performance without increasing computational resources, a non-negative feedback self-distillation method is proposed. Firstly, a virtual teacher self-distillation method is proposed to enhance the model generalization, which achieves good results in pixel-wise classification task but has less improvement in SOD. Secondly, to understand the behavior of the self-distillation loss, the gradient directions of KL and Cross Entropy (CE) loss are analyzed. It is found that KL can create inconsistent gradients with the opposite direction to CE in SOD. Finally, a non-negative feedback loss is proposed for SOD, which uses different ways to calculate the distillation loss of the foreground and background respectively, to ensure that the teacher network transfers only positive knowledge to the student. The experiments on five datasets show that the proposed self-distillation methods can effectively improve the performance of SOD models, and the average Fß is increased by about 2.7% compared with the baseline network.
RESUMO
Due to its capacity to gather vast, high-level data about human activity from wearable or stationary sensors, human activity recognition substantially impacts people's day-to-day lives. Multiple people and things may be seen acting in the video, dispersed throughout the frame in various places. Because of this, modeling the interactions between many entities in spatial dimensions is necessary for visual reasoning in the action recognition task. The main aim of this paper is to evaluate and map the current scenario of human actions in red, green, and blue videos, based on deep learning models. A residual network (ResNet) and a vision transformer architecture (ViT) with a semi-supervised learning approach are evaluated. The DINO (self-DIstillation with NO labels) is used to enhance the potential of the ResNet and ViT. The evaluated benchmark is the human motion database (HMDB51), which tries to better capture the richness and complexity of human actions. The obtained results for video classification with the proposed ViT are promising based on performance metrics and results from the recent literature. The results obtained using a bi-dimensional ViT with long short-term memory demonstrated great performance in human action recognition when applied to the HMDB51 dataset. The mentioned architecture presented 96.7 ± 0.35% and 41.0 ± 0.27% in terms of accuracy (mean ± standard deviation values) in the train and test phases of the HMDB51 dataset, respectively.
Assuntos
Aprendizado Profundo , Humanos , Redes Neurais de Computação , Aprendizado de Máquina Supervisionado , Atividades Humanas , Movimento (Física)RESUMO
Most existing point cloud instance segmentation methods require accurate and dense point-level annotations, which are extremely laborious to collect. While incomplete and inexact supervision has been exploited to reduce labeling efforts, inaccurate supervision remains under-explored. This kind of supervision is almost inevitable in practice, especially in complex 3D point clouds, and it severely degrades the generalization performance of deep networks. To this end, we propose the first weakly supervised point cloud instance segmentation framework with inaccurate box-level labels. A novel self-distillation architecture is presented to boost the generalization ability while leveraging the cheap but noisy bounding-box annotations. Specifically, we employ consistency regularization to distill self-knowledge from data perturbation and historical predictions, which prevents the deep network from overfitting the noisy labels. Moreover, we progressively select reliable samples and correct their labels based on the historical consistency. Extensive experiments on the ScanNet-v2 dataset were used to validate the effectiveness and robustness of our method in dealing with inexact and inaccurate annotations.
RESUMO
Vision transformers efficiently model long-range context and thus have demonstrated impressive accuracy gains in several image analysis tasks including segmentation. However, such methods need large labeled datasets for training, which is hard to obtain for medical image analysis. Self-supervised learning (SSL) has demonstrated success in medical image segmentation using convolutional networks. In this work, we developed a self-distillation learning with masked image modeling method to perform SSL for vision transformers (SMIT) applied to 3D multi-organ segmentation from CT and MRI. Our contribution combines a dense pixel-wise regression pretext task performed within masked patches called masked image prediction with masked patch token distillation to pre-train vision transformers. Our approach is more accurate and requires fewer fine tuning datasets than other pretext tasks. Unlike prior methods, which typically used image sets arising from disease sites and imaging modalities corresponding to the target tasks, we used 3,643 CT scans (602,708 images) arising from head and neck, lung, and kidney cancers as well as COVID-19 for pre-training and applied it to abdominal organs segmentation from MRI pancreatic cancer patients as well as publicly available 13 different abdominal organs segmentation from CT. Our method showed clear accuracy improvement (average DSC of 0.875 from MRI and 0.878 from CT) with reduced requirement for fine-tuning datasets over commonly used pretext tasks. Extensive comparisons against multiple current SSL methods were done. Our code is available at: https://github.com/harveerar/SMIT.git.
RESUMO
Current deep-learning-based cervical cell classification methods suffer from parameter redundancy and poor model generalization performance, which creates challenges for the intelligent classification of cervical cytology smear images. In this paper, we establish a method for such classification that combines transfer learning and knowledge distillation. This new method not only transfers common features between different source domain data, but also realizes model-to-model knowledge transfer using the unnormalized probability output between models as knowledge. A multi-exit classification network is then introduced as the student network, where a global context module is embedded in each exit branch. A self-distillation method is then proposed to fuse contextual information; deep classifiers in the student network guide shallow classifiers to learn, and multiple classifier outputs are fused using an average integration strategy to form a classifier with strong generalization performance. The experimental results show that the developed method achieves good results using the SIPaKMeD dataset. The accuracy, sensitivity, specificity, and F-measure of the five classifications are 98.52%, 98.53%, 98.68%, 98.59%, respectively. The effectiveness of the method is further verified on a natural image dataset.
RESUMO
The rapid development of spatial transcriptomics allows the measurement of RNA abundance at a high spatial resolution, making it possible to simultaneously profile gene expression, spatial locations of cells or spots, and the corresponding hematoxylin and eosin-stained histology images. It turns promising to predict gene expression from histology images that are relatively easy and cheap to obtain. For this purpose, several methods are devised, but they have not fully captured the internal relations of the 2D vision features or spatial dependency between spots. Here, we developed Hist2ST, a deep learning-based model to predict RNA-seq expression from histology images. Around each sequenced spot, the corresponding histology image is cropped into an image patch and fed into a convolutional module to extract 2D vision features. Meanwhile, the spatial relations with the whole image and neighbored patches are captured through Transformer and graph neural network modules, respectively. These learned features are then used to predict the gene expression by following the zero-inflated negative binomial distribution. To alleviate the impact by the small spatial transcriptomics data, a self-distillation mechanism is employed for efficient learning of the model. By comprehensive tests on cancer and normal datasets, Hist2ST was shown to outperform existing methods in terms of both gene expression prediction and spatial region identification. Further pathway analyses indicated that our model could reserve biological information. Thus, Hist2ST enables generating spatial transcriptomics data from histology images for elucidating molecular signatures of tissues.
Assuntos
Processamento de Imagem Assistida por Computador , Transcriptoma , Amarelo de Eosina-(YS) , Hematoxilina , Processamento de Imagem Assistida por Computador/métodos , Redes Neurais de Computação , RNARESUMO
Convolutional neural networks (CNN) and its variants have been widely used for developing the histopathological image based computer-aided diagnosis (CAD). However, the annotated data are scarce in clinical practice, and limited training samples generally cannot well train the CNN model, resulting in degraded predictive performance. To this end, we propose a novel Self-Distilled Supervised Contrastive Learning (SDSCL) algorithm to improve the diagnostic performance of a CNN-based CAD for breast cancer. In particular, the original histopathological images are first decomposed into H and E stain views, which are served as the augmented sample pairs in the supervised contrastive learning (SCL). Due to the complementary characteristic of the two stain views, this data-driven SCL guide the CNN model to efficiently learn the discriminative features, alleviating the problem of small sample size. Furthermore, self-distillation is embedded into the SCL framework, in which the CNN model jointly distills itself and conducts SCL to further improve feature representation. The proposed SDSCL is evaluated on two public breast histopathological datasets, which outperforms all the compared algorithms. Its average classification accuracy, precision, recall, and F1 scores are 94.28%, 94.64%, 94.58%, 94.34%, respectively, on Bioimaging dataset, and 80.44%, 81.92%, 80.57%, 80.10% on Databiox dataset. The experimental results on two datasets indicate that SDSCL has the potential for the histopathological image based CAD.