In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning.
In particular, we achieve new state-of-the-art accuracies of 72. 8% on HMDB-51 and 95. 2% on UCF-101.
We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions.