In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.
Ranked #3 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)
First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.
Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning.
The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs.
A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data.
In particular, we achieve new state-of-the-art accuracies of 72. 8% on HMDB-51 and 95. 2% on UCF-101.
In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable.