TUMTraffic-VideoQA is a novel dataset designed to understand spatiotemporal video in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatiotemporal object expressions, TUMTraffic-VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatiotemporal object grounding—within a cohesive evaluation framework.
Paper | Code | Results | Date | Stars |
---|