Video Question Answering
67 papers with code • 11 benchmarks • 24 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).
We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
In this paper, we create a novel dataset, SUTD-TrafficQA (Traffic Question Answering), which takes the form of video QA based on the collected 10, 080 in-the-wild videos and annotated 62, 535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios.