Multi-Modal Methods

Vokenization is an approach for extrapolating multimodal alignments to language-only data by contextually mapping language tokens to their related images ("vokens") by retrieval. Instead of directly supervising the language model with visually grounded language datasets (e.g., MS COCO) these relative small datasets are used to train the vokenization processor (i.e. the vokenizer). Vokens are generated for large language corpora (e.g., English Wikipedia), and the visually-supervised language model takes the input supervision from these large datasets, thus bridging the gap between different data sources.

Source: Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 3 25.00%
Visual Grounding 2 16.67%
Grounded language learning 1 8.33%
Language Acquisition 1 8.33%
Sentence 1 8.33%
Image Retrieval 1 8.33%
Retrieval 1 8.33%
Video Grounding 1 8.33%
Image Captioning 1 8.33%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories