Methods > Natural Language Processing > Transformers

VideoBERT adapts the powerful BERT model to learn a joint visual-linguistic representation for video. It is used in numerous tasks, including action classification and video captioning.

Source: VideoBERT: A Joint Model for Video and Language Representation Learning

Latest Papers

PAPER DATE
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
Chenyi LeiShixian LuoYong liuWanggui HeJiamang WangGuoxin WangHaihong TangChunyan MiaoHouqiang Li
2021-04-19
Visual Grounding Strategies for Text-Only Natural Language Processing
Damien Sileo
2021-03-25
VideoBERT: A Joint Model for Video and Language Representation Learning
| Chen SunAustin MyersCarl VondrickKevin MurphyCordelia Schmid
2019-04-03

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories