We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space.
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process.
Semantic image synthesis enables control over unconditional image generation by allowing guidance on what is being generated.
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.
Ranked #2 on GZSL Video Classification on UCF-GZSL(main)
Recent advances in XAI provide explanations for models trained on still images.
From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration.