Self-Supervised Action Recognition
34 papers with code • 6 benchmarks • 5 datasets
Most implemented papers
Video Representation Learning by Dense Predictive Coding
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning.
Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video
We propose a self-supervised visual learning method by predicting the variable playback speeds of a video.
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning
The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks.
SpeedNet: Learning the Speediness in Videos
We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval.
Audio-Visual Instance Discrimination with Cross-Modal Agreement
Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa.
Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning
The generative perception model acts as a feature decoder to focus on comprehending high temporal resolution and short-term representation by introducing a motion-attention mechanism.
Self-Supervised MultiModal Versatile Networks
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
Self-supervised Co-training for Video Representation Learning
The objective of this paper is visual-only self-supervised video representation learning.