Should we Replace CNNs with Transformers for Medical Images?

29 Sep 2021 · Christos Matsoukas, Johan Fredin Haslum, Moein Sorkhei, Magnus Soderberg, Kevin Smith ·

Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis, pushing the state-of-the-art in classification, detection and segmentation tasks. Recently, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding impressive levels of performance in the natural image domain, while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore whether it is feasible to switch to transformer-based models in the medical imaging domain as well, or if we should keep working with CNNs - can we trivially replace CNNs with transformers? We consider this question in a series of experiments on several standard medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers are on par with CNNs when pretrained on ImageNet in both classification and segmentation tasks. Further, ViTs often outperform their CNN counterparts when pretrained using self-supervision.

PDF Abstract