Self-Supervised Action Recognition

17 papers with code • 5 benchmarks • 4 datasets

This task has no description! Would you like to contribute one?

Greatest papers with code

Self-Supervised MultiModal Versatile Networks

deepmind/deepmind-research NeurIPS 2020

In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.

Action Recognition In Videos Audio Classification +2

Contrastive Multiview Coding

HobbitLong/PyContrast ECCV 2020

We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.

Object Classification Self-Supervised Action Recognition +1

Video Representation Learning by Dense Predictive Coding

TengdaHan/DPC 10 Sep 2019

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.

Representation Learning Self-Supervised Action Recognition +1

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

BestJuly/Inter-intra-video-contrastive-learning 6 Aug 2020

With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.

Action Recognition In Videos Representation Learning +4

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

HumamAlwassel/XDC NeurIPS 2020

To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Audio Classification Clustering +5

Temporally Coherent Embeddings for Self-Supervised Video Representation Learning

csiro-robotics/TCE 21 Mar 2020

The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks.

Metric Learning Self-Supervised Action Recognition +2