ClipBERT is a framework for end-to-end-learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Two aspects distinguish ClipBERT from previous work.

First, in contrast to densely extracting video features (adopted by most existing methods), CLIPBERT sparsely samples only one single or a few short clips from the full-length videos at each training step. The hypothesis is that visual features from sparse clips already capture key visual and semantic information in the video, as consecutive clips usually contain similar semantics from a continuous scene. Thus, a handful of clips are sufficient for training, instead of using the full video. Then, predictions from multiple densely-sampled clips are aggregated to obtain the final video-level prediction during inference, which is less computational demanding.

The second differentiating aspect concerns the initialization of model weights (i.e., transfer through pre-training). The authors use 2D architectures (e.g., ResNet-50) instead of 3D features as the visual backbone for video encoding, allowing them to harness the power of image-text pretraining for video-text understanding along with the advantages of low memory cost and runtime efficiency.

Source: Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling


Paper Code Results Date Stars


Task Papers Share
Question Answering 1 25.00%
Video Question Answering 1 25.00%
Video Retrieval 1 25.00%
Visual Question Answering 1 25.00%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign