17 papers with code • 5 benchmarks • 4 datasets
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
Ranked #1 on Self-Supervised Action Recognition on HMDB51
We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.
Ranked #29 on Self-Supervised Action Recognition on UCF101
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.
Ranked #15 on Self-Supervised Action Recognition on UCF101
The objective of this paper is visual-only self-supervised video representation learning.
With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.
Ranked #2 on Self-supervised Video Retrieval on HMDB51
We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach.
Ranked #27 on Self-Supervised Action Recognition on HMDB51
To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
Ranked #1 on Self-Supervised Action Recognition on UCF101
Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa.
Ranked #2 on Self-Supervised Audio Classification on ESC-50
The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive proxy tasks.
Ranked #17 on Self-Supervised Action Recognition on HMDB51