ViLBERT - Visual Question Answering

Model Name:*

Description with Markdown (optional):

# Summary

ViLBERT (short for Vision-and-Language BERT), is a model for learning task-agnostic joint representations of image content and natural language.

[Explore live Visual Question Answering demo at AllenNLP](https://demo.allennlp.org/visual-question-answering/vilbert-vqa).

## How do I load this model?

```python
from allennlp_models.pretrained import load_predictor
predictor = load_predictor("vqa-vilbert")
```

## Getting predictions

```python
image_path = "https://storage.googleapis.com/allennlp-public-data/vqav2/baseball.jpg"
question = "What game are they playing?"
preds = predictor.predict(image_path, question)
best_prob, best_answer = max(zip(preds["probs"], preds["tokens"]), key=lambda x: x[0])
print(f"p({best_answer}) = {best_prob:.2%}")
# prints: p(baseball) = 100.00%
```

You can also get predictions using allennlp command line interface:

```shell
echo '{"question": "What game are they playing?", "image": "https://storage.googleapis.com/allennlp-public-data/vqav2/baseball.jpg"}' | \
    allennlp predict https://storage.googleapis.com/allennlp-public-models/vilbert-vqa-pretrained.2021-02-11.tar.gz -
```

## How do I evaluate this model?
To evaluate the model on VQA dataset run:

```shell
allennlp evaluate https://storage.googleapis.com/allennlp-public-models/vilbert-vqa-pretrained.2021-02-11.tar.gz \
    balanced_real_val
```

## How do I train this model?

To train this model you can use `allennlp` CLI tool and the configuration file [vilbert_vqa_pretrained.jsonnet](https://raw.githubusercontent.com/allenai/allennlp-models/v2.1.0/training_config/vision/vilbert_vqa_pretrained.jsonnet):

```shell
allennlp train vilbert_vqa_pretrained.jsonnet -s output_dir
```

See the [AllenNLP Training and prediction](https://guide.allennlp.org/training-and-prediction#2) guide for more details.

## Citation

```bibtex
@inproceedings{Lu2019ViLBERTPT,
 author = {Jiasen Lu and Dhruv Batra and D. Parikh and Stefan Lee},
 booktitle = {NeurIPS},
 title = {ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks},
 year = {2019}
}
```

Paper:*

Code URL (optional):

LR	0.00004
Epochs	40

RESNET

Training Techniques	AdamW
Architecture	Dropout, Layer Normalization, Linear Layer, Residual Network, ResNet
LR	0.00004
Epochs	40
SHOW MORE
SHOW LESS

ViLBERT - Visual Question Answering

allenai / allennlp

Summary

How do I load this model?

Getting predictions

How do I evaluate this model?

How do I train this model?

Citation