TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	AVA v2.2	ORViT MViT-B, 16x4 (K400 pretraining)	mAP	26.6	# 34
Action Recognition	Diving-48	ORViT TimeSformer	Accuracy	88.0	# 6
Action Recognition	EPIC-KITCHENS-100	ORViT Mformer-L (ORViT blocks)	Action@1	45.7	# 13
Action Recognition	EPIC-KITCHENS-100	ORViT Mformer-L (ORViT blocks)	Verb@1	68.4	# 16
Action Recognition	EPIC-KITCHENS-100	ORViT Mformer-L (ORViT blocks)	Noun@1	58.7	# 12
Action Recognition	Something-Something V2	ORViT Mformer (ORViT blocks)	Top-1 Accuracy	67.9	# 55
Action Recognition	Something-Something V2	ORViT Mformer (ORViT blocks)	Top-5 Accuracy	90.5	# 54
Action Recognition	Something-Something V2	ORViT Mformer (ORViT blocks)	Parameters	N/A	# 37
Action Recognition	Something-Something V2	ORViT Mformer (ORViT blocks)	GFLOPs	N/A	# 6
Action Recognition	Something-Something V2	ORViT Mformer-L (ORViT blocks)	Top-1 Accuracy	69.5	# 44
Action Recognition	Something-Something V2	ORViT Mformer-L (ORViT blocks)	Top-5 Accuracy	91.5	# 35
Action Recognition	Something-Something V2	ORViT Mformer-L (ORViT blocks)	Parameters	N/A	# 37
Action Recognition	Something-Something V2	ORViT Mformer-L (ORViT blocks)	GFLOPs	N/A	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/object-region-video-transformers-1/action-recognition-on-diving-48)](https://paperswithcode.com/sota/action-recognition-on-diving-48?p=object-region-video-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/object-region-video-transformers-1/action-recognition-on-epic-kitchens-100)](https://paperswithcode.com/sota/action-recognition-on-epic-kitchens-100?p=object-region-video-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/object-region-video-transformers-1/action-recognition-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=object-region-video-transformers-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/object-region-video-transformers-1/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=object-region-video-transformers-1)`

Object-Region Video Transformers

CVPR 2022 · Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson ·

Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at \url{https://roeiherz.github.io/ORViT/}

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

eladb3/orvit

Tasks

Add Remove

Action Detection

Action Recognition

Few-Shot action recognition

Few Shot Action Recognition

Object

Video Understanding

Datasets

MS COCO

Something-Something V2

EPIC-KITCHENS-100

AVA

Results from the Paper

Edit

Ranked #6 on Action Recognition on Diving-48

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	AVA v2.2	ORViT MViT-B, 16x4 (K400 pretraining)	mAP	26.6	# 34	Compare
Action Recognition	Diving-48	ORViT TimeSformer	Accuracy	88.0	# 6	Compare
Action Recognition	Something-Something V2	ORViT Mformer (ORViT blocks)	Top-1 Accuracy	67.9	# 55	Compare
			Top-5 Accuracy	90.5	# 54	Compare
			Parameters	N/A	# 37	Compare
			GFLOPs	N/A	# 6	Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Compare
Action Recognition	EPIC-KITCHENS-100	ORViT Mformer-L (ORViT blocks)	Action@1	45.7	# 13	See all
			Verb@1	68.4	# 16	See all
			Noun@1	58.7	# 12	See all
Action Recognition	Something-Something V2	ORViT Mformer-L (ORViT blocks)	Top-1 Accuracy	69.5	# 44	See all
			Top-5 Accuracy	91.5	# 35	See all
			Parameters	N/A	# 37	See all
			GFLOPs	N/A	# 6	See all

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Object-Region Video Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit