Visual Dialog

54 papers with code • 8 benchmarks • 10 datasets

Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.

Libraries

Use these libraries to find Visual Dialog models and implementations

Latest papers with no code

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

no code yet • 18 Mar 2024

The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.

$\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal $\mathbb{GR}$aphs

no code yet • 25 Oct 2023

We propose $\mathbb{VD}$-$\mathbb{GR}$ - a novel visual dialog model that combines pre-trained language models (LMs) with graph neural networks (GNNs).

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

no code yet • 30 Aug 2023

We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually grounded conversations.

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.

Adversarial Robustness of Visual Dialog

no code yet • 6 Jul 2022

This study is the first to investigate the robustness of visually grounded dialog models towards textual attacks.

UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog

no code yet • CVPR 2022

In this paper, we propose a contrastive learning-based framework UTC to unify and facilitate both discriminative and generative tasks in visual dialog with a single model.

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

no code yet • 15 Apr 2022

Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history.

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

no code yet • 10 Apr 2022

In our model, the external knowledge is represented with sentence-level facts and graph-level facts, to properly suit the scenario of the composite of dialog history and image.

Modeling Coreference Relations in Visual Dialog

no code yet • EACL 2021

Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image.

VU-BERT: A Unified framework for Visual Dialog

no code yet • 22 Feb 2022

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history.