The goal of this dataset is to probe video-language models for understanding of simple temporal relations like "before" and "after". The dataset is only meant to be an evaluation set and not a training set.
Contents: 1. The dataset has synthetic videos which consists of a pair of shapes appearing gradually. For example, video for the caption "a red circle appears after a yellow circle" will first show a "yellow circle" appear and then a "red circle" appear. The model has to determine the right caption in comparison with a distractor caption "a yellow circle appears after a red circle". Note that this distractor caption has the same set of words but in a different order, motivated by the Winograd schema. 2. The dataset also has a control set in which videos only have a single event, e.g., "a red circle appears". Note that this is a control task to ensure that these videos are not out-of-distribution for a given video model. A time-aware model shall perform perfectly well on both sets. A space-aware model that is not time-aware shall perform poorly on the temporal task while performing perfectly on the control task.
Paper | Code | Results | Date | Stars |
---|