Recent research has revealed that reducing the temporal and spatial redundancy are both effective approaches towards efficient video recognition, e. g., allocating the majority of computation to a task-relevant subset of frames or the most valuable image regions of each frame.
To alleviate these issues, we propose to simultaneously conduct feature alignment in two individual spaces focusing on different domains, and create for each space a domain-oriented classifier tailored specifically for that domain.
Spatial redundancy widely exists in visual recognition tasks, i. e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand.
Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy.
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain.
Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
Ranked #27 on Image Classification on CIFAR-100 (using extra training data)
In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency.
Reusing features in deep networks through dense connectivity is an effective way to achieve high computational efficiency.
To remedy this, we propose a Transferable Semantic Augmentation (TSA) approach to enhance the classifier adaptation ability through implicitly generating source features towards target semantics.
Real-world training data usually exhibits long-tailed distribution, where several majority classes have a significantly larger number of samples than the remaining minority classes.
Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint.
As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm.
The accuracy of deep convolutional neural networks (CNNs) generally improves when fueled with high resolution images.
The proposed method is inspired by the intriguing property that deep networks are effective in learning linearized features, i. e., certain directions in the deep feature space correspond to meaningful semantic transformations, e. g., changing the background or view angle of an object.
In this paper, we propose a novel meta-learning based SSL algorithm (Meta-Semi) that requires tuning only one additional hyper-parameter, compared with a standard supervised deep learning algorithm, to achieve competitive performance under various conditions of SSL.
Our work is motivated by the intriguing property that deep networks are surprisingly good at linearizing features, such that certain directions in the deep feature space correspond to meaningful semantic transformations, e. g., adding sunglasses or changing backgrounds.