RESUMO
Magnetic Resonance Spectroscopic Imaging (MRSI) is a non-invasive imaging technique for studying metabolism and has become a crucial tool for understanding neurological diseases, cancers and diabetes. High spatial resolution MRSI is needed to characterize lesions, but in practice MRSI is acquired at low resolution due to time and sensitivity restrictions caused by the low metabolite concentrations. Therefore, there is an imperative need for a post-processing approach to generate high-resolution MRSI from low-resolution data that can be acquired fast and with high sensitivity. Deep learning-based super-resolution methods provided promising results for improving the spatial resolution of MRSI, but they still have limited capability to generate accurate and high-quality images. Recently, diffusion models have demonstrated superior learning capability than other generative models in various tasks, but sampling from diffusion models requires iterating through a large number of diffusion steps, which is time-consuming. This work introduces a Flow-based Truncated Denoising Diffusion Model (FTDDM) for super-resolution MRSI, which shortens the diffusion process by truncating the diffusion chain, and the truncated steps are estimated using a normalizing flow-based network. The network is conditioned on upscaling factors to enable multi-scale super-resolution. To train and evaluate the deep learning models, we developed a 1H-MRSI dataset acquired from 25 high-grade glioma patients. We demonstrate that FTDDM outperforms existing generative models while speeding up the sampling process by over 9-fold compared to the baseline diffusion model. Neuroradiologists' evaluations confirmed the clinical advantages of our method, which also supports uncertainty estimation and sharpness adjustment, extending its potential clinical applications.
RESUMO
Myeloablative pre-conditioning facilitates the differentiation of transplanted hematopoietic stem and progenitor cells (HSPCs). However, the factors in the stress environment that regulate HSPC behavior remain elusive. Here, we investigated the mechanisms that shaped the cell fates of transplanted murine multipotent progenitors (MPPs) expressing the Fms-related receptor tyrosine kinase 3 gene (Flt3). Using lineage tracing, clonal analysis, and single-cell RNA sequencing (RNA-seq), we showed that the myeloablative environment increased lymphoid priming of Flt3+ MPPs and that their efficient B cell output required intact interleukin 1 (IL-1) signaling. The Flt3+ MPPs with short-term exposure to IL-1ß underwent a myeloid-biased to lymphoid-biased cell fate switch and produced more lymphoid-biased progeny with a stronger B lymphopoiesis capacity in vitro. Correspondingly, a brief exposure to IL-1ß facilitated the B cell output of transplanted Flt3+ MPPs in vivo. Together, our study demonstrated an unrecognized function of IL-1ß in promoting B lymphopoiesis and highlighted a latent effect of IL-1ß in regulating MPP cell fate dynamics.
RESUMO
Recent studies on contrastive learning have achieved remarkable performance solely by leveraging few labels in the context of medical image segmentation. Existing methods mainly focus on instance discrimination and invariant mapping (i.e., pulling positive samples closer and negative samples apart in the feature space). However, they face three common pitfalls: (1) tailness: medical image data usually follows an implicit long-tail class distribution. Blindly leveraging all pixels in training hence can lead to the data imbalance issues, and cause deteriorated performance; (2) consistency: it remains unclear whether a segmentation model has learned meaningful and yet consistent anatomical features due to the intra-class variations between different anatomical features; and (3) diversity: the intra-slice correlations within the entire dataset have received significantly less attention. This motivates us to seek a principled approach for strategically making use of the dataset itself to discover similar yet distinct samples from different anatomical views. In this paper, we introduce a novel semi-supervised 2D medical image segmentation framework termed Mine yOur owNAnatomy (MONA), and make three contributions. First, prior work argues that every pixel equally matters to the model training; we observe empirically that this alone is unlikely to define meaningful anatomical features, mainly due to lacking the supervision signal. We show two simple solutions towards learning invariances-through the use of stronger data augmentations and nearest neighbors. Second, we construct a set of objectives that encourage the model to be capable of decomposing medical images into a collection of anatomical features in an unsupervised manner. Lastly, we both empirically and theoretically, demonstrate the efficacy of our MONA on three benchmark datasets with different labeled settings, achieving new state-of-the-art under different labeled semi-supervised settings. MONA makes minimal assumptions on domain expertise, and hence constitutes a practical and versatile solution in medical image analysis. We provide the PyTorch-like pseudo-code in supplementary.
RESUMO
Image registration is an essential step in many medical image analysis tasks. Traditional methods for image registration are primarily optimization-driven, finding the optimal deformations that maximize the similarity between two images. Recent learning-based methods, trained to directly predict transformations between two images, run much faster, but suffer from performance deficiencies due to domain shift. Here we present a new neural network based image registration framework, called NIR (Neural Image Registration), which is based on optimization but utilizes deep neural networks to model deformations between image pairs. NIR represents the transformation between two images with a continuous function implemented via neural fields, receiving a 3D coordinate as input and outputting the corresponding deformation vector. NIR provides two ways of generating deformation field: directly output a displacement vector field for general deformable registration, or output a velocity vector field and integrate the velocity field to derive the deformation field for diffeomorphic image registration. The optimal registration is discovered by updating the parameters of the neural field via stochastic mini-batch gradient descent. We describe several design choices that facilitate model optimization, including coordinate encoding, sinusoidal activation, coordinate sampling, and intensity sampling. NIR is evaluated on two 3D MR brain scan datasets, demonstrating highly competitive performance in terms of both registration accuracy and regularity. Compared to traditional optimization-based methods, our approach achieves better results in shorter computation times. In addition, our methods exhibit performance on a cross-dataset registration task, compared to the pre-trained learning-based methods.
Assuntos
Imageamento Tridimensional , Imageamento por Ressonância Magnética , Redes Neurais de Computação , Humanos , Imageamento por Ressonância Magnética/métodos , Imageamento Tridimensional/métodos , Encéfalo/diagnóstico por imagem , Algoritmos , Processamento de Imagem Assistida por Computador/métodosRESUMO
Medical image segmentation is a critical component in clinical practice, facilitating accurate diagnosis, treatment planning, and disease monitoring. However, existing methods, often tailored to specific modalities or disease types, lack generalizability across the diverse spectrum of medical image segmentation tasks. Here we present MedSAM, a foundation model designed for bridging this gap by enabling universal medical image segmentation. The model is developed on a large-scale medical image dataset with 1,570,263 image-mask pairs, covering 10 imaging modalities and over 30 cancer types. We conduct a comprehensive evaluation on 86 internal validation tasks and 60 external validation tasks, demonstrating better accuracy and robustness than modality-wise specialist models. By delivering accurate and efficient segmentation across a wide spectrum of tasks, MedSAM holds significant potential to expedite the evolution of diagnostic tools and the personalization of treatment plans.
Assuntos
Diagnóstico por Imagem , Processamento de Imagem Assistida por Computador , Processamento de Imagem Assistida por Computador/métodosRESUMO
Deep neural networks have been integrated into the whole clinical decision procedure which can improve the efficiency of diagnosis and alleviate the heavy workload of physicians. Since most neural networks are supervised, their performance heavily depends on the volume and quality of available labels. However, few such labels exist for rare diseases (e.g., new pandemics). Here we report a medical multimodal large language model (Med-MLLM) for radiograph representation learning, which can learn broad medical knowledge (e.g., image understanding, text semantics, and clinical phenotypes) from unlabelled data. As a result, when encountering a rare disease, our Med-MLLM can be rapidly deployed and easily adapted to them with limited labels. Furthermore, our model supports medical data across visual modality (e.g., chest X-ray and CT) and textual modality (e.g., medical report and free-text clinical note); therefore, it can be used for clinical tasks that involve both visual and textual data. We demonstrate the effectiveness of our Med-MLLM by showing how it would perform using the COVID-19 pandemic "in replay". In the retrospective setting, we test the model on the early COVID-19 datasets; and in the prospective setting, we test the model on the new variant COVID-19-Omicron. The experiments are conducted on 1) three kinds of input data; 2) three kinds of downstream tasks, including disease reporting, diagnosis, and prognosis; 3) five COVID-19 datasets; and 4) three different languages, including English, Chinese, and Spanish. All experiments show that our model can make accurate and robust COVID-19 decision-support with little labelled data.
RESUMO
Objective.Metal artifact reduction (MAR) has been a key issue in CT imaging. Recently, MAR methods based on deep learning have achieved promising results. However, when deploying deep learning-based MAR in real-world clinical scenarios, two prominent challenges arise. One limitation is the lack of paired training data in real applications, which limits the practicality of supervised methods. Another limitation is that image-domain methods suitable for more application scenarios are inadequate in performance while end-to-end approaches with better performance are only applicable to fan-beam CT due to large memory consumption.Approach.We propose a novel image-domain MAR method based on the generative adversarial network with variable constraints (MARGANVAC) to improve MAR performance. The proposed variable constraint is a kind of time-varying cost function that can relax the fidelity constraint at the beginning and gradually strengthen the fidelity constraint as the training progresses. To better deploy our image-domain supervised method into practical scenarios, we develop a transfer method to mimic the real metal artifacts by first extracting the real metal traces and then adding them to artifact-free images to generate paired training data.Main results.The effectiveness of the proposed method is validated in simulated fan-beam experiments and real cone-beam experiments. All quantitative and qualitative results demonstrate that the proposed method achieves superior performance compared with the competing methods.Significance.The MARGANVAC model proposed in this paper is an image-domain model that can be conveniently applied to various scenarios such as fan beam and cone beam CT. At the same time, its performance is on par with the cutting-edge dual-domain MAR approaches. In addition, the metal artifact transfer method proposed in this paper can easily generate paired data with real artifact features, which can be better used for model training in real scenarios.
Assuntos
Artefatos , Tomografia Computadorizada por Raios X , Tomografia Computadorizada por Raios X/métodos , Tomografia Computadorizada de Feixe Cônico , Algoritmos , Metais , Processamento de Imagem Assistida por Computador/métodos , Imagens de FantasmasRESUMO
Contrastive learning has shown great promise over annotation scarcity problems in the context of medical image segmentation. Existing approaches typically assume a balanced class distribution for both labeled and unlabeled medical images. However, medical image data in reality is commonly imbalanced (i.e., multi-class label imbalance), which naturally yields blurry contours and usually incorrectly labels rare objects. Moreover, it remains unclear whether all negative samples are equally negative. In this work, we present ACTION, an Anatomical-aware ConTrastive dIstillatiON framework, for semi-supervised medical image segmentation. Specifically, we first develop an iterative contrastive distillation algorithm by softly labeling the negatives rather than binary supervision between positive and negative pairs. We also capture more semantically similar features from the randomly chosen negative set compared to the positives to enforce the diversity of the sampled data. Second, we raise a more important question: Can we really handle imbalanced samples to yield better performance? Hence, the key innovation in ACTION is to learn global semantic relationship across the entire dataset and local anatomical features among the neighbouring pixels with minimal additional memory footprint. During the training, we introduce anatomical contrast by actively sampling a sparse set of hard negative pixels, which can generate smoother segmentation boundaries and more accurate predictions. Extensive experiments across two benchmark datasets and different unlabeled settings show that ACTION significantly outperforms the current state-of-the-art semi-supervised methods.
RESUMO
OBJECTIVES: The aim of this study was to evaluate the severity of COVID-19 patients' disease by comparing a multiclass lung lesion model to a single-class lung lesion model and radiologists' assessments in chest computed tomography scans. MATERIALS AND METHODS: The proposed method, AssessNet-19, was developed in 2 stages in this retrospective study. Four COVID-19-induced tissue lesions were manually segmented to train a 2D-U-Net network for a multiclass segmentation task followed by extensive extraction of radiomic features from the lung lesions. LASSO regression was used to reduce the feature set, and the XGBoost algorithm was trained to classify disease severity based on the World Health Organization Clinical Progression Scale. The model was evaluated using 2 multicenter cohorts: a development cohort of 145 COVID-19-positive patients from 3 centers to train and test the severity prediction model using manually segmented lung lesions. In addition, an evaluation set of 90 COVID-19-positive patients was collected from 2 centers to evaluate AssessNet-19 in a fully automated fashion. RESULTS: AssessNet-19 achieved an F1-score of 0.76 ± 0.02 for severity classification in the evaluation set, which was superior to the 3 expert thoracic radiologists (F1 = 0.63 ± 0.02) and the single-class lesion segmentation model (F1 = 0.64 ± 0.02). In addition, AssessNet-19 automated multiclass lesion segmentation obtained a mean Dice score of 0.70 for ground-glass opacity, 0.68 for consolidation, 0.65 for pleural effusion, and 0.30 for band-like structures compared with ground truth. Moreover, it achieved a high agreement with radiologists for quantifying disease extent with Cohen κ of 0.94, 0.92, and 0.95. CONCLUSIONS: A novel artificial intelligence multiclass radiomics model including 4 lung lesions to assess disease severity based on the World Health Organization Clinical Progression Scale more accurately determines the severity of COVID-19 patients than a single-class model and radiologists' assessment.
Assuntos
COVID-19 , Humanos , Inteligência Artificial , Estudos Retrospectivos , Pulmão/diagnóstico por imagem , Tomografia Computadorizada por Raios X/métodos , Progressão da DoençaRESUMO
BACKGROUND: Many dedicated cone-beam CT (CBCT) systems have irregular scanning trajectories. Compared with the standard CBCT calibration, accurate calibration for CBCT systems with irregular trajectories is a more complex task, since the geometric parameters for each scanning view are variable. Most of the existing calibration methods assume that the intrinsic geometric relationship of the fiducials in the phantom is precisely known, and rarely delve deeper into the issue of whether the phantom accuracy is adapted to the calibration model. PURPOSE: A high-precision phantom and a highly robust calibration model are interdependent and mutually supportive, and they are both important for calibration accuracy, especially for the high-resolution CBCT. Therefore, we propose a calibration scheme that considers both accurate phantom measurement and robust geometric calibration. METHODS: Our proposed scheme consists of two parts: (1) introducing a measurement model to acquire the accurate intrinsic geometric relationship of the fiducials in the phantom; (2) developing a highly noise-robust nonconvex model-based calibration method. The measurement model in the first part is achieved by extending our previous high-precision geometric calibration model suitable for CBCT with circular trajectories. In the second part, a novel iterative method with optimization constraints based on a back-projection model is developed to solve the geometric parameters of each view. RESULTS: The simulations and real experiments show that the measurement errors of the fiducial ball bearings (BBs) are within the subpixel level. With the help of the geometric relationship of the BBs obtained by our measurement method, the classic calibration method can achieve good calibration based on far fewer BBs. All metrics obtained in simulated experiments as well as in real experiments on Micro CT systems with resolutions of 9 and 4.5 µm show that the proposed calibration method has higher calibration accuracy than the competing classic method. It is particularly worth noting that although our measurement model proves to be very accurate, the classic calibration method based on this measurement model can only achieve good calibration results when the resolution of the measurement system is close to that of the system to be calibrated, but our calibration scheme enables high-accuracy calibration even when the resolution of the system to be calibrated is twice that of the measurement system. CONCLUSIONS: The proposed combined geometrical calibration scheme does not rely on a phantom with an intricate pattern of fiducials, so it is applicable in Micro CT with high resolution. The two parts of the scheme, the "measurement model" and the "calibration model," prove to be of high accuracy. The combination of these two models can effectively improve the calibration accuracy, especially in some extreme cases.
Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador , Humanos , Calibragem , Processamento de Imagem Assistida por Computador/métodos , Tomografia Computadorizada de Feixe Cônico/métodos , Microtomografia por Raio-X , Imagens de FantasmasRESUMO
Head movement during long scan sessions degrades the quality of reconstruction in positron emission tomography (PET) and introduces artifacts, which limits clinical diagnosis and treatment. Recent deep learning-based motion correction work utilized raw PET list-mode data and hardware motion tracking (HMT) to learn head motion in a supervised manner. However, motion prediction results were not robust to testing subjects outside the training data domain. In this paper, we integrate a cross-attention mechanism into the supervised deep learning network to improve motion correction across test subjects. Specifically, cross-attention learns the spatial correspondence between the reference images and moving images to explicitly focus the model on the most correlative inherent information - the head region the motion correction. We validate our approach on brain PET data from two different scanners: HRRT without time of flight (ToF) and mCT with ToF. Compared with traditional and deep learning benchmarks, our network improved the performance of motion correction by 58% and 26% in translation and rotation, respectively, in multi-subject testing in HRRT studies. In mCT studies, our approach improved performance by 66% and 64% for translation and rotation, respectively. Our results demonstrate that cross-attention has the potential to improve the quality of brain PET image reconstruction without the dependence on HMT. All code will be released on GitHub: https://github.com/OnofreyLab/dl_hmc_attention_mlcn2023.
RESUMO
Head motion correction is an essential component of brain PET imaging, in which even motion of small magnitude can greatly degrade image quality and introduce artifacts. Building upon previous work, we propose a new head motion correction framework taking fast reconstructions as input. The main characteristics of the proposed method are: (i) the adoption of a high-resolution short-frame fast reconstruction workflow; (ii) the development of a novel encoder for PET data representation extraction; and (iii) the implementation of data augmentation techniques. Ablation studies are conducted to assess the individual contributions of each of these design choices. Furthermore, multi-subject studies are conducted on an 18F-FPEB dataset, and the method performance is qualitatively and quantitatively evaluated by MOLAR reconstruction study and corresponding brain Region of Interest (ROI) Standard Uptake Values (SUV) evaluation. Additionally, we also compared our method with a conventional intensity-based registration method. Our results demonstrate that the proposed method outperforms other methods on all subjects, and can accurately estimate motion for subjects out of the training set. All code is publicly available on GitHub: https://github.com/OnofreyLab/dl-hmc_fast_recon_miccai2023.
RESUMO
Integrating high-level semantically correlated contents and low-level anatomical features is of central importance in medical image segmentation. Towards this end, recent deep learning-based medical segmentation methods have shown great promise in better modeling such information. However, convolution operators for medical segmentation typically operate on regular grids, which inherently blur the high-frequency regions, i.e., boundary regions. In this work, we propose MORSE, a generic implicit neural rendering framework designed at an anatomical level to assist learning in medical image segmentation. Our method is motivated by the fact that implicit neural representation has been shown to be more effective in fitting complex signals and solving computer graphics problems than discrete grid-based representation. The core of our approach is to formulate medical image segmentation as a rendering problem in an end-to-end manner. Specifically, we continuously align the coarse segmentation prediction with the ambiguous coordinate-based point representations and aggregate these features to adaptively refine the boundary region. To parallelly optimize multi-scale pixel-level features, we leverage the idea from Mixture-of-Expert (MoE) to design and train our MORSE with a stochastic gating mechanism. Our experiments demonstrate that MORSE can work well with different medical segmentation backbones, consistently achieving competitive performance improvements in both 2D and 3D supervised medical segmentation methods. We also theoretically analyze the superiority of MORSE.
RESUMO
For medical image segmentation, contrastive learning is the dominant practice to improve the quality of visual representations by contrasting semantically similar and dissimilar pairs of samples. This is enabled by the observation that without accessing ground truth labels, negative examples with truly dissimilar anatomical features, if sampled, can significantly improve the performance. In reality, however, these samples may come from similar anatomical regions and the models may struggle to distinguish the minority tail-class samples, making the tail classes more prone to misclassification, both of which typically lead to model collapse. In this paper, we propose ARCO, a semi-supervised contrastive learning (CL) framework with stratified group theory for medical image segmentation. In particular, we first propose building ARCO through the concept of variance-reduced estimation and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks with extremely limited labels. Furthermore, we theoretically prove these sampling techniques are universal in variance reduction. Finally, we experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings, and our methods consistently outperform state-of-the-art semi-supervised methods. Additionally, we augment the CL frameworks with these sampling techniques and demonstrate significant gains over previous methods. We believe our work is an important step towards semi-supervised medical image segmentation by quantifying the limitation of current self-supervision objectives for accomplishing such challenging safety-critical tasks.
RESUMO
Medical data often exhibits long-tail distributions with heavy class imbalance, which naturally leads to difficulty in classifying the minority classes (i.e., boundary regions or rare objects). Recent work has significantly improved semi-supervised medical image segmentation in long-tailed scenarios by equipping them with unsupervised contrastive criteria. However, it remains unclear how well they will perform in the labeled portion of data where class distribution is also highly imbalanced. In this work, we present ACTION++, an improved contrastive learning framework with adaptive anatomical contrast for semi-supervised medical segmentation. Specifically, we propose an adaptive supervised contrastive loss, where we first compute the optimal locations of class centers uniformly distributed on the embedding space (i.e., off-line), and then perform online contrastive matching training by encouraging different class features to adaptively match these distinct and uniformly distributed class centers. Moreover, we argue that blindly adopting a constant temperature τ in the contrastive loss on long-tailed medical data is not optimal, and propose to use a dynamic τ via a simple cosine schedule to yield better separation between majority and minority classes. Empirically, we evaluate ACTION++ on ACDC and LA benchmarks and show that it achieves state-of-the-art across two semi-supervised settings. Theoretically, we analyze the performance of adaptive anatomical contrast and confirm its superiority in label efficiency.
RESUMO
Automated segmentation in medical image analysis is a challenging task that requires a large amount of manually labeled data. However, most existing learning-based approaches usually suffer from limited manually annotated medical data, which poses a major practical problem for accurate and robust medical image segmentation. In addition, most existing semi-supervised approaches are usually not robust compared with the supervised counterparts, and also lack explicit modeling of geometric structure and semantic information, both of which limit the segmentation accuracy. In this work, we present SimCVD, a simple contrastive distillation framework that significantly advances state-of-the-art voxel-wise representation learning. We first describe an unsupervised training strategy, which takes two views of an input volume and predicts their signed distance maps of object boundaries in a contrastive objective, with only two independent dropout as mask. This simple approach works surprisingly well, performing on the same level as previous fully supervised methods with much less labeled data. We hypothesize that dropout can be viewed as a minimal form of data augmentation and makes the network robust to representation collapse. Then, we propose to perform structural distillation by distilling pair-wise similarities. We evaluate SimCVD on two popular datasets: the Left Atrial Segmentation Challenge (LA) and the NIH pancreas CT dataset. The results on the LA dataset demonstrate that, in two types of labeled ratios (i.e., 20% and 10%), SimCVD achieves an average Dice score of 90.85% and 89.03% respectively, a 0.91% and 2.22% improvement compared to previous best results. Our method can be trained in an end-to-end fashion, showing the promise of utilizing SimCVD as a general framework for downstream tasks, such as medical image synthesis, enhancement, and registration.
Assuntos
Destilação , Processamento de Imagem Assistida por Computador , Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina Supervisionado , Tomografia Computadorizada por Raios X/métodosRESUMO
Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.
RESUMO
Contrastive learning (CL) aims to learn useful representation without relying on expert annotations in the context of medical image segmentation. Existing approaches mainly contrast a single positive vector (i.e., an augmentation of the same image) against a set of negatives within the entire remainder of the batch by simply mapping all input features into the same constant vector. Despite the impressive empirical performance, those methods have the following shortcomings: (1) it remains a formidable challenge to prevent the collapsing problems to trivial solutions; and (2) we argue that not all voxels within the same image are equally positive since there exist the dissimilar anatomical structures with the same image. In this work, we present a novel Contrastive Voxel-wise Representation Learning (CVRL) method to effectively learn low-level and high-level features by capturing 3D spatial context and rich anatomical information along both the feature and the batch dimensions. Specifically, we first introduce a novel CL strategy to ensure feature diversity promotion among the 3D representation dimensions. We train the framework through bi-level contrastive optimization (i.e., low-level and high-level) on 3D images. Experiments on two benchmark datasets and different labeled settings demonstrate the superiority of our proposed framework. More importantly, we also prove that our method inherits the benefit of hardness-aware property from the standard CL approaches. Codes will be available soon.
RESUMO
Many medical datasets have recently been created for medical image segmentation tasks, and it is natural to question whether we can use them to sequentially train a single model that (1) performs better on all these datasets, and (2) generalizes well and transfers better to the unknown target site domain. Prior works have achieved this goal by jointly training one model on multi-site datasets, which achieve competitive performance on average but such methods rely on the assumption about the availability of all training data, thus limiting its effectiveness in practical deployment. In this paper, we propose a novel multi-site segmentation framework called incremental-transfer learning (ITL), which learns a model from multi-site datasets in an end-to-end sequential fashion. Specifically, "incremental" refers to training sequentially constructed datasets, and "transfer" is achieved by leveraging useful information from the linear combination of embedding features on each dataset. In addition, we introduce our ITL framework, where we train the network including a site-agnostic encoder with pretrained weights and at most two segmentation decoder heads. We also design a novel site-level incremental loss in order to generalize well on the target domain. Second, we show for the first time that leveraging our ITL training scheme is able to alleviate challenging catastrophic forgetting problems in incremental learning. We conduct experiments using five challenging benchmark datasets to validate the effectiveness of our incremental-transfer learning approach. Our approach makes minimal assumptions on computation resources and domain-specific expertise, and hence constitutes a strong starting point in multi-site medical image segmentation.
RESUMO
Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.