Self-Supervised Action Recognition

27 papers with code • 6 benchmarks • 4 datasets

Contrastive Multiview Coding

HobbitLong/CMC ECCV 2020

We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.

Spatiotemporal Contrastive Video Representation Learning

tensorflow/models CVPR 2021

Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

BestJuly/Inter-intra-video-contrastive-learning 6 Aug 2020

With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

facebookresearch/SlowFast CVPR 2021

We present a large-scale study on unsupervised spatiotemporal representation learning from videos.

Unsupervised Representation Learning by Sorting Sequences

HsinYingLee/OPN ICCV 2017

We present an unsupervised representation learning approach using videos without semantic labels.

Video Representation Learning by Dense Predictive Coding

TengdaHan/DPC 10 Sep 2019

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

HumamAlwassel/XDC NeurIPS 2020

To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

BestJuly/VCP 2 Jan 2020

As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning.