Grounded Situation Recognition
8 papers with code • 1 benchmarks • 0 datasets
Grounded Situation Recognition aims to produce the structured image summary which describes the primary activity (verb), its relevant entities (nouns), and their bounding-box groundings.
Semantic sparsity is a common challenge in structured visual classification problems; when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set.
To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation.
This paper introduces situation recognition, the problem of producing a concise summary of the situation an image depicts including: (1) the main activity (e. g., clipping), (2) the participating actors, objects, substances, and locations (e. g., man, shears, sheep, wool, and field) and most importantly (3) the roles these participants play in the activity (e. g., the man is clipping, the shears are his tool, the wool is being clipped from the sheep, and the clipping is in a field).
We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e. g. agent, tool), and bounding-box groundings of entities.
However, existing query-based reasoning methods have not considered handling of inter-dependent queries which is a unique requirement of semantic role prediction in SR.
Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage.