|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers.
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
Ranked #1 on Self-Supervised Action Recognition on HMDB51
The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.
Ranked #1 on Audio Classification on AudioSet
Interpretability of deep neural networks is a recently emerging area of machine learning research targeting a better understanding of how models perform feature selection and derive their classification decisions.
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio.
Despite sound being a rich source of information, computing devices with microphones do not leverage audio to glean useful insights about their physical and social context.
We introduce a probabilistic approach to unify open set recognition with the prevention of catastrophic forgetting in deep continual learning, based on variational Bayesian inference.
Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa.
Ranked #2 on Self-Supervised Audio Classification on ESC-50
To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
Ranked #1 on Self-Supervised Action Recognition on UCF101