ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.
Source: ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Visual Question Answering (VQA) | 3 | 14.29% |
Visual Reasoning | 3 | 14.29% |
Cross-Modal Retrieval | 2 | 9.52% |
Image Retrieval | 2 | 9.52% |
Zero-Shot Cross-Modal Retrieval | 2 | 9.52% |
Question Answering | 1 | 4.76% |
Visual Question Answering | 1 | 4.76% |
Image Captioning | 1 | 4.76% |
Natural Language Understanding | 1 | 4.76% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |