Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge... (read more)

PDF Abstract
No code implementations yet. Submit your code now

Results from the Paper


 Ranked #1 on Image Classification on VTAB-1k (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Cross-Modal Retrieval COCO 2014 ALIGN Image-to-text R@1 77 # 1
Image-to-text R@10 96.9 # 1
Image-to-text R@5 93.5 # 1
Text-to-image R@1 59.9 # 1
Text-to-image R@10 89.8 # 1
Text-to-image R@5 83.3 # 1
Cross-Modal Retrieval Flickr30k ALIGN Image-to-text R@1 95.3 # 1
Image-to-text R@10 100 # 1
Image-to-text R@5 99.8 # 1
Text-to-image R@1 84.9 # 1
Text-to-image R@10 98.6 # 1
Text-to-image R@5 97.4 # 1
Image Classification Flowers-102 ALIGN Accuracy 99.65% # 5
Fine-Grained Image Classification Food-101 ALIGN Accuracy 95.88 # 2
Image Classification ImageNet ALIGN (EfficientNet-L2) Top 1 Accuracy 88.64% # 4
Top 5 Accuracy 98.67% # 5
Number of params 480M # 6
Fine-Grained Image Classification Oxford-IIIT Pets ALIGN Accuracy 96.19% # 5
Fine-Grained Image Classification Stanford Cars ALIGN Accuracy 96.13% # 3
Image Classification VTAB-1k ALIGN (50 hypers/task) Top-1 Accuracy 79.99 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet