Self-Supervised Action Recognition

21 papers with code • 6 benchmarks • 4 datasets

This task has no description! Would you like to contribute one?

Greatest papers with code

Spatiotemporal Contrastive Video Representation Learning

tensorflow/models CVPR 2021

Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.

 Ranked #1 on Self-Supervised Action Recognition on Kinetics-400 (using extra training data)

Contrastive Learning Data Augmentation +4

Self-Supervised MultiModal Versatile Networks

deepmind/deepmind-research NeurIPS 2020

In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.

Action Recognition In Videos Audio Classification +2

Contrastive Multiview Coding

HobbitLong/PyContrast ECCV 2020

We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.

Contrastive Learning Object Classification +2

Video Representation Learning by Dense Predictive Coding

TengdaHan/DPC 10 Sep 2019

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.

Representation Learning Self-Supervised Action Recognition +1

Learning the Predictability of the Future

cvlab-columbia/hyperfuture CVPR 2021

We introduce a framework for learning from unlabeled video what is predictable in the future.

Hierarchical structure Representation Learning +2

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

BestJuly/IIC 6 Aug 2020

With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.

Action Recognition In Videos Contrastive Learning +5

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

HumamAlwassel/XDC NeurIPS 2020

To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Audio Classification Deep Clustering +4