On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining.
In this paper, we collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset.
In video transformers, the time dimension is often treated in the same way as the two spatial dimensions.
Ranked #4 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)
We tackle the problem of learning object detectors without supervision.
First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well.
Privacy considerations and bias in datasets are quickly becoming high-priority issues that the computer vision community needs to face.
Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in US Labour Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations.
In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning.
A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data.
In particular, we achieve new state-of-the-art accuracies of 72. 8% on HMDB-51 and 95. 2% on UCF-101.
We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels.