Multi-label zero-shot learning
12 papers with code • 3 benchmarks • 2 datasets
In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage.
In this paper, we propose a novel deep learning architecture for multi-label zero-shot learning (ML-ZSL), which is able to predict multiple unseen class labels for each input instance.
Audio-based music classification and tagging is typically based on categorical supervised learning with a fixed set of labels.
Therefore, instead of generating attentions for unseen labels which have unknown behaviors and could focus on irrelevant regions due to the lack of any training sample, we let the unseen labels select among a set of shared attentions which are trained to be label-agnostic and to focus on only relevant/foreground regions through our novel loss.
Interaction Compass: Multi-Label Zero-Shot Learning of Human-Object Interactions via Spatial Relations
We study the problem of multi-label zero-shot recognition in which labels are in the form of human-object interactions (combinations of actions on objects), each image may contain multiple interactions and some interactions do not have training images.
We argue that using a single embedding vector to represent an image, as commonly practiced, is not sufficient to rank both relevant seen and unseen labels accurately.
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts.
We note that the best existing multi-label ZSL method takes a shared approach towards attending to region features with a common set of attention maps for all the classes.