TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	Mirasol3B	Accuracy	51.13	# 4
Audio Classification	EPIC-SOUNDS	Mirasol3B	Accuracy	78.2	# 1
Action Classification	Kinetics-Sounds	Mirasol3B	Top 1 Accuracy	90.1	# 1
Video Question Answering	MSRVTT-QA	Mirasol3B	Accuracy	50.42	# 1
Video Question Answering	NExT-QA	Mirasol3B	Accuracy	72	# 8
Audio Classification	VGGSound	Mirasol3B	Top 1 Accuracy	69.8	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mirasol3b-a-multimodal-autoregressive-model/audio-classification-on-epic-sounds)](https://paperswithcode.com/sota/audio-classification-on-epic-sounds?p=mirasol3b-a-multimodal-autoregressive-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mirasol3b-a-multimodal-autoregressive-model/action-classification-on-kinetics-sounds)](https://paperswithcode.com/sota/action-classification-on-kinetics-sounds?p=mirasol3b-a-multimodal-autoregressive-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mirasol3b-a-multimodal-autoregressive-model/video-question-answering-on-msrvtt-qa)](https://paperswithcode.com/sota/video-question-answering-on-msrvtt-qa?p=mirasol3b-a-multimodal-autoregressive-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mirasol3b-a-multimodal-autoregressive-model/audio-classification-on-vggsound)](https://paperswithcode.com/sota/audio-classification-on-vggsound?p=mirasol3b-a-multimodal-autoregressive-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mirasol3b-a-multimodal-autoregressive-model/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=mirasol3b-a-multimodal-autoregressive-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mirasol3b-a-multimodal-autoregressive-model/video-question-answering-on-next-qa)](https://paperswithcode.com/sota/video-question-answering-on-next-qa?p=mirasol3b-a-multimodal-autoregressive-model)`

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

9 Nov 2023 · AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova ·

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Classification

Audio Classification

Video Question Answering

Datasets

Kinetics

VGG-Sound

ActivityNet-QA

NExT-QA MSRVTT-QA

EPIC-SOUNDS

Results from the Paper

Edit

Ranked #1 on Audio Classification on VGGSound

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	Mirasol3B	Accuracy	51.13	# 4	Compare
Audio Classification	EPIC-SOUNDS	Mirasol3B	Accuracy	78.2	# 1	Compare
Action Classification	Kinetics-Sounds	Mirasol3B	Top 1 Accuracy	90.1	# 1	Compare
Video Question Answering	MSRVTT-QA	Mirasol3B	Accuracy	50.42	# 1	Compare
Video Question Answering	NExT-QA	Mirasol3B	Accuracy	72	# 8	Compare
Audio Classification	VGGSound	Mirasol3B	Top 1 Accuracy	69.8	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove