MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation
We propose a novel echocardiographical video segmentation model by adapting SAM to medical videos to address some long-standing challenges in ultrasound video segmentation including (1) massive speckle noise and artifacts (2) extremely ambiguous boundaries and (3) large variations of targeting objects across frames. The core technique of our model is a temporal-aware and noise-resilient prompting scheme. Specifically we employ a space-time memory that contains both spatial and temporal information to prompt the segmentation of current frame and thus we call the proposed model as MemSAM. In prompting the memory carrying temporal cues sequentially prompt the video segmentation frame by frame. Meanwhile as the memory prompt propagates high-level features it avoids the issue of misidentification caused by mask propagation and improves representation consistency. To address the challenge of speckle noise we further propose a memory reinforcement mechanism which leverages predicted masks to improve the quality of the memory before storing it. We extensively evaluate our method on two public datasets and demonstrate state-of-the-art performance compared to existing models. Particularly our model achieves comparable performance with fully supervised approaches with limited annotations. Codes are available at https://github.com/dengxl0520/MemSAM.
PDF Abstract