Abstract
In recent times, denoising diffusion probabilistic models (DPMs) have proven to show significant success in medical image generation and denoising, while also serving as powerful representation learners for downstream tasks such as segmentation. However, their effectiveness in segmentation is limited by the need for detailed pixel-wise annotations, which are expensive, time-consuming, and require expert knowledge—a significant bottleneck in real-world clinical applications. In order to mitigate this limitation of label-efficiency, we propose a fast and efficient model named FastTextDiff, a diffusion-based segmentation model that integrates medical text annotations to enhance semantic representations.
Our approach leverages ModernBERT, a transformer-based language model capable of processing long medical text sequences, to establish a strong connection between textual annotations and semantic meaning in medical imaging. ModernBERT can efficiently encode clinical knowledge for directing segmentation tasks since it has been trained on both MIMIC-III and MIMIC-IV. Label-efficient segmentation with enhanced performance is made possible by cross-modal attention processes, which enable smooth interaction between visual and textual modalities.
This study validates ModernBERT as a quick and scalable substitute for Clinical BioBERT in diffusion-based segmentation pipelines and demonstrates the promise of multi-modal techniques for medical image analysis. By replacing Clinical BioBERT with ModernBERT, our model benefits from Flash Attention 2 for memory-efficient training, an alternating attention mechanism for computational efficiency, and a training corpus of 2 trillion tokens, significantly improving text-based medical image segmentation. Our experiments demonstrate that FastTextDiff achieves better segmentation performance and faster training compared to traditional diffusion-based models. We release this trained model on Hugging Face and GitHub.