Image Representations

VirText, or Visual representations from Textual annotations is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and Transformer are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.

Source: VirTex: Learning Visual Representations from Textual Annotations

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Reconstruction 1 12.50%
Quantization 1 12.50%
General Classification 1 12.50%
Image Captioning 1 12.50%
Image Classification 1 12.50%
Instance Segmentation 1 12.50%
Object Detection 1 12.50%
Semantic Segmentation 1 12.50%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories