TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	InfiMM-Eval	CogVLM-Chat	Overall score	37.16	# 4
Visual Question Answering (VQA)	InfiMM-Eval	CogVLM-Chat	Deductive	36.75	# 4
Visual Question Answering (VQA)	InfiMM-Eval	CogVLM-Chat	Abductive	47.88	# 4
Visual Question Answering (VQA)	InfiMM-Eval	CogVLM-Chat	Analogical	28.75	# 3
Visual Question Answering (VQA)	InfiMM-Eval	CogVLM-Chat	Params	17B	# 1
Visual Question Answering	MM-Vet	GLM4 Vision	GPT-4 score	63.9	# 5
Visual Question Answering	MM-Vet	CogVLM(Vicuna-7B)	GPT-4 score	52.8	# 15
Visual Question Answering	MM-Vet	CogVLM(Vicuna-7B)	Params	17B	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cogvlm-visual-expert-for-pretrained-language/visual-question-answering-vqa-on-core-mm)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-core-mm?p=cogvlm-visual-expert-for-pretrained-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cogvlm-visual-expert-for-pretrained-language/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=cogvlm-visual-expert-for-pretrained-language)`

CogVLM: Visual Expert for Pretrained Language Models

6 Nov 2023 · Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang ·

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

PDF Abstract

Code

Add Remove Mark official

thudm/cogvlm official

5,035

Tasks

Add Remove

Language Modelling

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Question Answering

MMLU

RefCOCO

OK-VQA

TextVQA

NoCaps

ScienceQA

Visual7W

MMBench

MM-Vet TextCaps

SEED-Bench LLaVA-Bench

MathVista

InfiMM-Eval

Results from the Paper

Edit

Ranked #4 on Visual Question Answering (VQA) on InfiMM-Eval

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	InfiMM-Eval	CogVLM-Chat	Overall score	37.16	# 4	Compare
			Deductive	36.75	# 4	Compare
			Abductive	47.88	# 4	Compare
			Analogical	28.75	# 3	Compare
			Params	17B	# 1	Compare
Visual Question Answering	MM-Vet	GLM4 Vision	GPT-4 score	63.9	# 5	Compare
Visual Question Answering	MM-Vet	CogVLM(Vicuna-7B)	GPT-4 score	52.8	# 15	Compare
Visual Question Answering	MM-Vet	CogVLM(Vicuna-7B)	Params	17B	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

CogVLM: Visual Expert for Pretrained Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove