Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
Ranked #19 on Visual Question Answering on VQA v2 test-dev
In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Ranked #3 on Visual Question Answering on VQA v1 test-std
Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.
In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.
Ranked #8 on Visual Dialog on VisDial v0.9 val
Next, we find that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG -- more than 10% over our base model -- but hurts MRR -- more than 17% below our base model!
More importantly, we can tell which modality (visual or semantic) has more contribution in answering the current question by visualizing the gate values.
Ranked #6 on Visual Dialog on VisDial v0.9 val
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context.
Ranked #1 on Common Sense Reasoning on Visual Dialog v0.9
Experiments on both simulated and real-world data show that 1) our proposed learning framework achieves better accuracy than other supervised and reinforcement learning baselines and 2) user feedback based on natural language rather than pre-specified attributes leads to more effective retrieval results, and a more natural and expressive communication interface.