Multi-View Action Recognition Using Contrastive Learning

In this work, we present a method for RGB-based action recognition using multi-view videos. We present a supervised contrastive learning framework to learn a feature embedding robust to changes in viewpoint, by effectively leveraging multi-view data. We use an improved supervised contrastive loss and augment the positives with those coming from synchronized viewpoints. We also propose a new approach to use classifier probabilities to guide the selection of hard negatives in the contrastive loss, to learn a more discriminative representation. Negative samples from confusing classes based on posterior are weighted higher. We also show that our method leads to better domain generalization compared to the standard supervised training based on synthetic multi-view data. Extensive experiments on real (NTU-60, NTU-120, NUMA) and synthetic (RoCoG) data demonstrate the effectiveness of our approach.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Recognition NTU RGB+D ViewCon (RGB + Pose) Accuracy (CS) 93.7 # 12
Accuracy (CV) 98.9 # 3
Action Recognition NTU RGB+D 120 ViewCon (RGB) Accuracy (Cross-Subject) 85.6 # 12
Accuracy (Cross-Setup) 87.5 # 11

Methods