Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
Ranked #19 on
Visual Question Answering
on VQA v2 test-dev
We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content.
Ranked #15 on
Visual Dialog
on VisDial v0.9 val
In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Ranked #3 on
Visual Question Answering
on VQA v1 test-std
Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.
In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.
Ranked #8 on
Visual Dialog
on VisDial v0.9 val
Next, we find that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG -- more than 10% over our base model -- but hurts MRR -- more than 17% below our base model!
More importantly, we can tell which modality (visual or semantic) has more contribution in answering the current question by visualizing the gate values.
Ranked #6 on
Visual Dialog
on VisDial v0.9 val
FEATURE SELECTION QUESTION ANSWERING VISUAL DIALOG VISUAL QUESTION ANSWERING
Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image.
Ranked #13 on
Visual Dialog
on VisDial v0.9 val
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context.
Ranked #1 on
Common Sense Reasoning
on Visual Dialog v0.9
COMMON SENSE REASONING COREFERENCE RESOLUTION VISUAL DIALOG VISUAL GROUNDING VISUAL QUESTION ANSWERING
Experiments on both simulated and real-world data show that 1) our proposed learning framework achieves better accuracy than other supervised and reinforcement learning baselines and 2) user feedback based on natural language rather than pre-specified attributes leads to more effective retrieval results, and a more natural and expressive communication interface.