This work aims to improve the efficiency of vision transformers (ViT).
Though the state-of-the architectures for semantic segmentation, such as HRNet, demonstrate impressive accuracy, the complexity arising from their salient design choices hinders a range of model acceleration tools, and further they make use of operations that are inefficient on current hardware.
In this paper, we propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection that allows for heavy down-sampling of unimportant background regions while preserving the fine-grained details of a high-resolution image.
By extensive experiments on a wide range of architectures, including the most efficient ones, we demonstrate that delta distillation sets a new state of the art in terms of accuracy vs. efficiency trade-off for semantic segmentation and object detection in videos.
To the best of our knowledge, our proposals are the first solutions that integrate ROI-based capabilities into neural video compression models.
In this paper, we propose a conditional early exiting framework for efficient video recognition.
We reformulate standard convolution to be efficiently computed on residual frames: each layer is coupled with a binary gate deciding whether a residual is important to the model prediction,~\eg foreground regions, or it can be safely skipped, e. g. background regions.
We employ a model that consists of a 3D autoencoder with a discrete latent space and an autoregressive prior used for entropy coding.
In this paper, we introduce an approach to stochastically combine the root of variations with previous pose information, which forces the model to take the noise into account.
In our proposed embedding, which we call VideoStory, the correlations between the terms are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability. We show how learning the VideoStory using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation.