During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations.
In this paper, we present Deep Incubation, a novel approach that enables the efficient and effective training of large models by dividing them into smaller sub-modules that can be trained separately and assembled seamlessly.
However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting.
Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module.
Spatial redundancy widely exists in visual recognition tasks, i. e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand.
Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy.
In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency.
Reusing features in deep networks through dense connectivity is an effective way to achieve high computational efficiency.