We build a new evaluation set by adding spotting words to the images of ImageNet 2012 evaluation sets. There are 1,000 categories in ImageNet. For each category c, we find its most confusing category c*and spot the category name to every evaluation image.
This evaluation set is challenging for many CLIP models. For example, OpenAI CLIP B-16 got a top-1 accuracy of as low as 32%, which is much lower than the original ImageNet evaluation set.
Paper | Code | Results | Date | Stars |
---|