Vokenization

Introduced by Tan et al. in Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Vokenization is an approach for extrapolating multimodal alignments to language-only data by contextually mapping language tokens to their related images ("vokens") by retrieval. Instead of directly supervising the language model with visually grounded language datasets (e.g., MS COCO) these relative small datasets are used to train the vokenization processor (i.e. the vokenizer). Vokens are generated for large language corpora (e.g., English Wikipedia), and the visually-supervised language model takes the input supervision from these large datasets, thus bridging the gap between different data sources.

Source: Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	3	25.00%
Visual Grounding	2	16.67%
Grounded language learning	1	8.33%
Language Acquisition	1	8.33%
Sentence	1	8.33%
Image Retrieval	1	8.33%
Retrieval	1	8.33%
Video Grounding	1	8.33%
Image Captioning	1	8.33%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Multi-Modal Methods