AVQA

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video. For example, given a video showing a traffic intersection where the light turns red and the parking stick drops, and the question “why did the stick fall in the video?”, it requires to combine the visual information “the stick dropping” and the audio information of a train whistle to answer the question as “Here comes the train”. To achieve an accurate reasoning process and get the correct answer, it is essential to extract cues and contexts from both audio and visual modalities and discover their inner causal correlations.

Real-life scenarios contain more complex relationships between audio-visual objects and a wider varieties of audio-visual daily activities. AVQA is an audio-visual question answering dataset for the multimodal understanding of audio-visual objects and activities in real-life scenarios on videos. AVQA provides diverse sets of questions specially designed considering both audio and visual information, involving various relationships between objects or in activities.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages