Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e. g., reduce well-cleaned CC3M dataset from 2. 82M to 0. 67M ($\sim$24\%) and noisy YFCC15M from 15M to 2. 5M ($\sim$16. 7\%).
However, CL on VQA involves not only the expansion of label sets (new Answer sets).
In this paper, we introduce a new dataset called Kinetic-GEB+.
Ranked #1 on Text to Video Retrieval on Kinetics-GEB+ (text-to-video R@1 metric)
In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos to provide step-by-step help in the user's view.
In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).
This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks.