Self-Supervised Action Recognition
34 papers with code • 6 benchmarks • 5 datasets
Latest papers
Joint Adversarial and Collaborative Learning for Self-Supervised Action Recognition
Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task.
Part Aware Contrastive Learning for Self-Supervised Action Recognition
This paper proposes an attention-based contrastive learning framework for skeleton representation learning, called SkeAttnCLR, which integrates local similarity and global features for skeleton-based action representations.
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning
A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise.
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning
First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams.
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens
Masked Video Autoencoder (MVA) approaches have demonstrated their potential by significantly outperforming previous video representation learning methods.
Masked Motion Encoding for Self-Supervised Video Representation Learning
The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions.
SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos
One of the key reasons for this is that sampling pairs of similar video clips, a required step for many self-supervised contrastive learning methods, is currently done conservatively to avoid false positives.
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.