Commonly Uncommon: Semantic Sparsity in Situation Recognition

Semantic sparsity is a common challenge in structured visual classification problems; when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objects play within the activity. For this problem, we find empirically that most object-role combinations are rare, and current state-of-the-art models significantly underperform in this sparse data regime. We avoid many such errors by (1) introducing a novel tensor composition function that learns to share examples across role-noun combinations and (2) semantically augmenting our training data with automatically gathered examples of rarely observed outputs using web data. When integrated within a complete CRF-based structured prediction model, the tensor-based approach outperforms existing state of the art by a relative improvement of 2.11% and 4.40% on top-5 verb and noun-role accuracy, respectively. Adding 5 million images with our semantic augmentation techniques gives further relative improvements of 6.23% and 9.57% on top-5 verb and noun-role accuracy.

PDF Abstract CVPR 2017 PDF CVPR 2017 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Situation Recognition imSitu CRF + Aug Top-1 Verb 34.12 # 11
Top-1 Verb & Value 26.45 # 11
Top-5 Verbs 62.59 # 10
Top-5 Verbs & Value 46.88 # 9
Grounded Situation Recognition SWiG CRF + Aug Top-1 Verb 34.12 # 11
Top-1 Verb & Value 26.45 # 11
Top-5 Verbs 62.59 # 10
Top-5 Verbs & Value 46.88 # 9

Methods


No methods listed for this paper. Add relevant methods here