TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Situation Recognition	imSitu	CoFormer	Top-1 Verb	44.66	# 2
Situation Recognition	imSitu	CoFormer	Top-1 Verb & Value	35.98	# 1
Situation Recognition	imSitu	CoFormer	Top-5 Verbs	73.31	# 2
Situation Recognition	imSitu	CoFormer	Top-5 Verbs & Value	57.76	# 2
Grounded Situation Recognition	SWiG	CoFormer	Top-1 Verb	44.66	# 2
Grounded Situation Recognition	SWiG	CoFormer	Top-1 Verb & Value	35.98	# 2
Grounded Situation Recognition	SWiG	CoFormer	Top-1 Verb & Grounded-Value	29.05	# 3
Grounded Situation Recognition	SWiG	CoFormer	Top-5 Verbs	73.31	# 2
Grounded Situation Recognition	SWiG	CoFormer	Top-5 Verbs & Value	57.76	# 2
Grounded Situation Recognition	SWiG	CoFormer	Top-5 Verbs & Grounded-Value	46.25	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/collaborative-transformers-for-grounded/situation-recognition-on-imsitu)](https://paperswithcode.com/sota/situation-recognition-on-imsitu?p=collaborative-transformers-for-grounded)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/collaborative-transformers-for-grounded/grounded-situation-recognition-on-swig)](https://paperswithcode.com/sota/grounded-situation-recognition-on-swig?p=collaborative-transformers-for-grounded)`

Collaborative Transformers for Grounded Situation Recognition

CVPR 2022 · Junhyeong Cho, Youngseok Yoon, Suha Kwak ·

Grounded situation recognition is the task of predicting the main activity, entities playing certain roles within the activity, and bounding-box groundings of the entities in the given image. To effectively deal with this challenging task, we introduce a novel approach where the two processes for activity classification and entity estimation are interactive and complementary. To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation. Glance transformer predicts the main activity with the help of Gaze transformer that analyzes entities and their relations, while Gaze transformer estimates the grounded entities by focusing only on the entities relevant to the activity predicted by Glance transformer. Our CoFormer achieves the state of the art in all evaluation metrics on the SWiG dataset. Training code and model weights are available at https://github.com/jhcho99/CoFormer.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

jhcho99/coformer official

towhee-io/towhee

2,981

ruipingl/opensu

Tasks

Add Remove

Grounded Situation Recognition

Image Classification

Object Detection

Scene Understanding

Situation Recognition

Visual Grounding

Visual Reasoning

Datasets

ImageNet

FrameNet

Results from the Paper

Edit

Ranked #2 on Situation Recognition on imSitu

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Situation Recognition	imSitu	CoFormer	Top-1 Verb	44.66	# 2	Compare
			Top-1 Verb & Value	35.98	# 1	Compare
			Top-5 Verbs	73.31	# 2	Compare
			Top-5 Verbs & Value	57.76	# 2	Compare
Grounded Situation Recognition	SWiG	CoFormer	Top-1 Verb	44.66	# 2	Compare
			Top-1 Verb & Value	35.98	# 2	Compare
			Top-1 Verb & Grounded-Value	29.05	# 3	Compare
			Top-5 Verbs	73.31	# 2	Compare
			Top-5 Verbs & Value	57.76	# 2	Compare
			Top-5 Verbs & Grounded-Value	46.25	# 2	Compare

Methods

Add Remove

Adam • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Collaborative Transformers for Grounded Situation Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove