Video Question Answering
39 papers with code • 6 benchmarks • 17 datasets
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Ranked #2 on Visual Question Answering on MSRVTT-QA (using extra training data)
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).
Ranked #9 on Video Retrieval on MSR-VTT
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.
Ranked #3 on Video Question Answering on SUTD-TrafficQA
Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.
Ranked #2 on Video Question Answering on SUTD-TrafficQA
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.
Ranked #3 on Video Question Answering on TVQA