TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Diving-48	StructVit-B-4-1	Accuracy	88.3	# 4
Action Classification	Kinetics-400	StructViT-B-4-1	Acc@1	83.4	# 60
Action Recognition	Something-Something V1	StructVit-B-4-1	Top 1 Accuracy	61.3	# 7
Action Recognition	Something-Something V2	StructVit-B-4-1	Top-1 Accuracy	71.5	# 27

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-correlation-structures-for-vision/action-recognition-on-diving-48)](https://paperswithcode.com/sota/action-recognition-on-diving-48?p=learning-correlation-structures-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-correlation-structures-for-vision/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=learning-correlation-structures-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-correlation-structures-for-vision/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=learning-correlation-structures-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-correlation-structures-for-vision/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=learning-correlation-structures-for-vision)`

Learning Correlation Structures for Vision Transformers

5 Apr 2024 · Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho ·

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Classification

Action Recognition

Object

Video Classification

Datasets

ImageNet

Kinetics

Kinetics 400

Something-Something V2

Something-Something V1

FineGym

Results from the Paper

Edit

Ranked #4 on Action Recognition on Diving-48

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Diving-48	StructVit-B-4-1	Accuracy	88.3	# 4	Compare
Action Classification	Kinetics-400	StructViT-B-4-1	Acc@1	83.4	# 60	Compare
Action Recognition	Something-Something V1	StructVit-B-4-1	Top 1 Accuracy	61.3	# 7	Compare
Action Recognition	Something-Something V2	StructVit-B-4-1	Top-1 Accuracy	71.5	# 27	Compare

Methods

Add Remove

Convolution • Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

Learning Correlation Structures for Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove