Methods > General > Representation Learning

Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

Source: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Latest Papers

PAPER DATE
On the Role of Images for Analyzing Claims in Social Media
| Gullal S. CheemaSherzod HakimovEric Müller-BudackRalph Ewerth
2021-03-17
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models
Letitia ParcalabescuAlbert GattAnette FrankIacer Calixto
2020-12-22
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie LiZhe GanJingjing Liu
2020-12-15
A Multi-Modal Method for Satire Detection using Textual and Visual Cues
| Lily LiOr LeviPedram HosseiniDavid A. Broniatowski
2020-10-13
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
| Jaemin ChoJiasen LuDustin SchwenkHannaneh HajishirziAniruddha Kembhavi
2020-09-23
Contrastive Visual-Linguistic Pretraining
Lei ShiKai ShuangShijie GengPeng SuZhengkai JiangPeng GaoZuohui FuGerard de MeloSen Su
2020-07-26
What Does BERT with Vision Look At?
Liunian Harold LiMark YatskarDa YinCho-Jui HsiehKai-Wei Chang
2020-07-01
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize CaoZhe GanYu ChengLicheng YuYen-Chun ChenJingjing Liu
2020-05-15
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions
| Arjun R. AkulaSpandana GellaYaser Al-OnaizanSong-Chun ZhuSiva Reddy
2020-05-04
Generating Rationales in Visual Question Answering
Hammad A. AyyubiMd. Mehrab TanjimJulian J. McAuleyGarrison W. Cottrell
2020-04-04
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
| Vishvak MurahariDhruv BatraDevi ParikhAbhishek Das
2019-12-05
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
| Jiasen LuDhruv BatraDevi ParikhStefan Lee
2019-08-06

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories