Video Model Blocks

Support-set Based Cross-Supervision

Introduced by Ding et al. in Support-Set Based Cross-Supervision for Video Grounding

Sscs, or Support-set Based Cross-Supervision, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.

Specifically, in the Figure to the right, two video-text pairs { $V_{i}, L_{i}$}, {$V_{j} , L_{j}$ } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( {$X_{i}, Y_{i}$} and {$X_{j} , Y_{j}$} ) in a shared space are acquired. Base on the support-set module, the weighted average of $X_{i}$ and $X_{j}$ is computed to obtain $\bar{X}_{i}$, $\bar{X}_{j}$ respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs

Source: Support-Set Based Cross-Supervision for Video Grounding

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Video Grounding 1 100.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories