AVQA

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video. For example, given a video showing a traffic intersection where the light turns red and the parking stick drops, and the question “why did the stick fall in the video?”, it requires to combine the visual information “the stick dropping” and the audio information of a train whistle to answer the question as “Here comes the train”. To achieve an accurate reasoning process and get the correct answer, it is essential to extract cues and contexts from both audio and visual modalities and discover their inner causal correlations.

Real-life scenarios contain more complex relationships between audio-visual objects and a wider varieties of audio-visual daily activities. AVQA is an audio-visual question answering dataset for the multimodal understanding of audio-visual objects and activities in real-life scenarios on videos. AVQA provides diverse sets of questions specially designed considering both audio and visual information, involving various relationships between objects or in activities.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	Audio-Visual Question Answering (AVQA)	AVQA	ONE-PEACE

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Audio-Visual Question Answering (AVQA)

Similar Datasets

Google Refexp

SUTD-TrafficQA

MUSIC-AVQA

Usage

License

Unknown

AVQA

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

Google Refexp

SUTD-TrafficQA

MUSIC-AVQA

Usage

License

Modalities

Languages

AVQA

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

Google Refexp

SUTD-TrafficQA

MUSIC-AVQA

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages