Self-Supervised Action Recognition
33 papers with code • 6 benchmarks • 4 datasets
We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.
Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.
With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.
The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions.
A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise.
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach.