Visual Dialog
54 papers with code • 8 benchmarks • 10 datasets
Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
Libraries
Use these libraries to find Visual Dialog models and implementationsDatasets
Most implemented papers
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.
Learning to Reason: End-to-End Module Networks for Visual Question Answering
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.
Examining Cooperation in Visual Dialog Models
In this work we propose a blackbox intervention method for visual dialog models, with the aim of assessing the contribution of individual linguistic or visual components.
Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog
Goal-oriented dialogue tasks occur when a questioner asks an action-oriented question and an answerer responds with the intent of letting the questioner know a correct action to take.
Dialog-based Interactive Image Retrieval
Experiments on both simulated and real-world data show that 1) our proposed learning framework achieves better accuracy than other supervised and reinforcement learning baselines and 2) user feedback based on natural language rather than pre-specified attributes leads to more effective retrieval results, and a more natural and expressive communication interface.
Ask No More: Deciding when to guess in referential visual dialogue
We make initial steps towards this general goal by augmenting a task-oriented visual dialogue model with a decision-making component that decides whether to ask a follow-up question to identify a target referent in an image, or to stop the conversation to make a guess.
Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient
Learning goal-oriented dialogues by means of deep reinforcement learning has recently become a popular research topic.
Visual Reasoning with Multi-hop Feature Modulation
Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue.
Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context.