GZSL Video Classification
3 papers with code • 6 benchmarks • 1 datasets
Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-visual sequences.
Most implemented papers
Temporal and cross-modal attention for audio-visual zero-shot learning
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Focusing on the relatively underexplored task of audio-visual zero-shot learning, we propose to learn multi-modal representations from audio-visual data using cross-modal attention and exploit textual label embeddings for transferring knowledge from seen classes to unseen classes.
Boosting Audio-visual Zero-shot Learning with Large Language Models
Recent methods mainly focus on learning multi-modal features aligned with class names to enhance the generalization ability to unseen categories.