ViLBERT - Visual Question Answering

Last updated on Mar 15, 2021

ViLBERT - Visual Question Answering

Parameters 245 Million
File Size 863.88 MB

Training Techniques AdamW
Architecture Dropout, Layer Normalization, Linear Layer, Residual Network, ResNet
LR 4e-05
Epochs 40


ViLBERT (short for Vision-and-Language BERT), is a model for learning task-agnostic joint representations of image content and natural language.

Explore live Visual Question Answering demo at AllenNLP.

How do I load this model?

from allennlp_models.pretrained import load_predictor
predictor = load_predictor("vqa-vilbert")

Getting predictions

image_path = ""
question = "What game are they playing?"
preds = predictor.predict(image_path, question)
best_prob, best_answer = max(zip(preds["probs"], preds["tokens"]), key=lambda x: x[0])
print(f"p({best_answer}) = {best_prob:.2%}")
# prints: p(baseball) = 100.00%

You can also get predictions using allennlp command line interface:

echo '{"question": "What game are they playing?", "image": ""}' | \
    allennlp predict -

How do I evaluate this model?

To evaluate the model on VQA dataset run:

allennlp evaluate \

How do I train this model?

To train this model you can use allennlp CLI tool and the configuration file vilbert_vqa_pretrained.jsonnet:

allennlp train vilbert_vqa_pretrained.jsonnet -s output_dir

See the AllenNLP Training and prediction guide for more details.


 author = {Jiasen Lu and Dhruv Batra and D. Parikh and Stefan Lee},
 booktitle = {NeurIPS},
 title = {ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks},
 year = {2019}