TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering	BenchLMM	Otter-7B	GPT-3.5 score	39.13	# 8
Visual Question Answering (VQA)	InfiMM-Eval	Otter	Overall score	22.69	# 10
Visual Question Answering (VQA)	InfiMM-Eval	Otter	Deductive	22.49	# 11
Visual Question Answering (VQA)	InfiMM-Eval	Otter	Abductive	33.64	# 10
Visual Question Answering (VQA)	InfiMM-Eval	Otter	Analogical	13.33	# 10
Visual Question Answering (VQA)	InfiMM-Eval	Otter	Params	7B	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/otter-a-multi-modal-model-with-in-context/visual-question-answering-on-benchlmm)](https://paperswithcode.com/sota/visual-question-answering-on-benchlmm?p=otter-a-multi-modal-model-with-in-context)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/otter-a-multi-modal-model-with-in-context/visual-question-answering-vqa-on-core-mm)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-core-mm?p=otter-a-multi-modal-model-with-in-context)`

Otter: A Multi-Modal Model with In-Context Instruction Tuning

5 May 2023 · Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu ·

Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.

PDF Abstract

Code

Add Remove Mark official

luodian/otter official

3,452

Tasks

Add Remove

In-Context Learning

Instruction Following

Visual Question Answering

Visual Question Answering (VQA)

Visual Reasoning

Datasets

Introduced in the Paper:

MIMIC-IT

Used in the Paper:

Visual Question Answering

InfiMM-Eval

MMC4

BenchLMM

Results from the Paper

Add Remove

Ranked #8 on Visual Question Answering on BenchLMM

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	BenchLMM	Otter-7B	GPT-3.5 score	39.13	# 8	Compare
Visual Question Answering (VQA)	InfiMM-Eval	Otter	Overall score	22.69	# 10	Compare
			Deductive	22.49	# 11	Compare
			Abductive	33.64	# 10	Compare
			Analogical	13.33	# 10	Compare
			Params	7B	# 1	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove