ImageNet-Atr (ImageNet with Adversarial Text Regions)

Introduced by Cao et al. in Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness

We build a new evaluation set by adding spotting words to the images of ImageNet 2012 evaluation sets. There are 1,000 categories in ImageNet. For each category c, we find its most confusing category c*and spot the category name to every evaluation image.

This evaluation set is challenging for many CLIP models. For example, OpenAI CLIP B-16 got a top-1 accuracy of as low as 32%, which is much lower than the original ImageNet evaluation set.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages