TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Environment Sound Classification	ESC-50	IMP-MoE-L	Accuracy	65.1	# 6
Zero-Shot Action Recognition	HMDB51	IMP-MoE-L	Top-1 Accuracy	59.1	# 5
Zero-Shot Transfer Image Classification	ImageNet	IMP-MoE-L	Accuracy (Private)	83.9	# 8
Zero-Shot Action Recognition	Kinetics	IMP-MoE-L	Top-1 Accuracy	76.8	# 1
Zero-Shot Action Recognition	UCF101	IMP-MoE-L	Top-1 Accuracy	91.5	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/alternating-gradient-descent-and-mixture-of/zero-shot-action-recognition-on-kinetics)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-kinetics?p=alternating-gradient-descent-and-mixture-of)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/alternating-gradient-descent-and-mixture-of/zero-shot-action-recognition-on-ucf101)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-ucf101?p=alternating-gradient-descent-and-mixture-of)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/alternating-gradient-descent-and-mixture-of/zero-shot-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/zero-shot-action-recognition-on-hmdb51?p=alternating-gradient-descent-and-mixture-of)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/alternating-gradient-descent-and-mixture-of/zero-shot-environment-sound-classification-on-1)](https://paperswithcode.com/sota/zero-shot-environment-sound-classification-on-1?p=alternating-gradient-descent-and-mixture-of)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/alternating-gradient-descent-and-mixture-of/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=alternating-gradient-descent-and-mixture-of)`

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

NeurIPS 2023 · Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam ·

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Classification

Image Classification

Text Retrieval

Video Classification

Video-Text Retrieval

Zero-Shot Action Recognition

Zero-Shot Environment Sound Classification

Zero-Shot Learning

Zero-Shot Transfer Image Classification

Datasets

ImageNet

MS COCO

UCF101

Kinetics

HMDB51

Flickr30k

Kinetics 400

AudioSet

ESC-50

CC12M

Results from the Paper

Edit

Ranked #1 on Zero-Shot Action Recognition on Kinetics (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Environment Sound Classification	ESC-50	IMP-MoE-L	Accuracy	65.1	# 6	Compare
Zero-Shot Action Recognition	HMDB51	IMP-MoE-L	Top-1 Accuracy	59.1	# 5	Compare
Zero-Shot Transfer Image Classification	ImageNet	IMP-MoE-L	Accuracy (Private)	83.9	# 8	Compare
Zero-Shot Action Recognition	Kinetics	IMP-MoE-L	Top-1 Accuracy	76.8	# 1	Compare
Zero-Shot Action Recognition	UCF101	IMP-MoE-L	Top-1 Accuracy	91.5	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove