VisDial (Visual Dialog)

Introduced by Das et al. in Visual Dialog

Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max.

VisDial v1.0 contains 123K dialogues on MS COCO (2017 training set) for training split, 2K dialogues with validation images for validation split and 8K dialogues on test set for test-standard set. The previously released v0.5 and v0.9 versions of VisDial dataset (corresponding to older splits of MS COCO) are considered deprecated.

Source: Granular Multimodal Attention Networks for Visual Dialog

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Visual Dialog	Visual Dialog v1.0 test-std	Single
Visual Dialog	VisDial v0.9 val	9xFGA
Visual Dialog	VisDial v1.0 test-std	5xFGA + LS*+
Chat-based Image Retrieval	VisDial	ChatGPT & BLIP2
Common Sense Reasoning	Visual Dialog v0.9	PDUN
Common Sense Reasoning	Visual Dialog v0.9	NMN [kottur2018visual]