TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Something-Something V2	MorphMLP-B (IN-1K)	Top-1 Accuracy	70.1	# 38
Action Recognition	Something-Something V2	MorphMLP-B (IN-1K)	Top-5 Accuracy	92.8	# 20
Action Recognition	Something-Something V2	MorphMLP-B (IN-1K)	Parameters	68.5	# 29
Action Recognition	Something-Something V2	MorphMLP-B (IN-1K)	GFLOPs	197x3	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/morphmlp-a-self-attention-free-mlp-like/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=morphmlp-a-self-attention-free-mlp-like)`

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

24 Nov 2021 · David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou ·

Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at https://github.com/MTLab/MorphMLP.

PDF Abstract

Code

Add Remove Mark official

MTLab/MorphMLP official

165

liuruiyang98/Jittor-MLP

160

Tasks

Add Remove

Action Recognition

Image Classification

Representation Learning

Semantic Segmentation

Video Classification

Datasets

Something-Something V2

Something-Something V1

Results from the Paper

Edit

Ranked #38 on Action Recognition on Something-Something V2 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Something-Something V2	MorphMLP-B (IN-1K)	Top-1 Accuracy	70.1	# 38	Compare
			Top-5 Accuracy	92.8	# 20	Compare
			Parameters	68.5	# 29	Compare
			GFLOPs	197x3	# 6	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove