Situation Recognition: Visual Semantic Role Labeling for Image Understanding

CVPR 2016  ·  Mark Yatskar, Luke Zettlemoyer, Ali Farhadi ·

This paper introduces situation recognition, the problem of producing a concise summary of the situation an image depicts including: (1) the main activity (e.g., clipping), (2) the participating actors, objects, substances, and locations (e.g., man, shears, sheep, wool, and field) and most importantly (3) the roles these participants play in the activity (e.g., the man is clipping, the shears are his tool, the wool is being clipped from the sheep, and the clipping is in a field). We use FrameNet, a verb and role lexicon developed by linguists, to define a large space of possible situations and collect a large-scale dataset containing over 500 activities, 1,700 roles, 11,000 objects, 125,000 images, and 200,000 unique situations. We also introduce structured prediction baselines and show that, in activity-centric images, situation-driven prediction of objects and activities outperforms independent object and activity recognition.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Situation Recognition imSitu CRF Top-1 Verb 32.34 # 12
Top-1 Verb & Value 24.64 # 12
Top-5 Verbs 58.88 # 12
Top-5 Verbs & Value 42.76 # 12
Grounded Situation Recognition SWiG CRF Top-1 Verb 32.34 # 12
Top-1 Verb & Value 24.64 # 12
Top-5 Verbs 58.88 # 12
Top-5 Verbs & Value 42.76 # 12

Methods


No methods listed for this paper. Add relevant methods here