TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	2k	11.1	# 8
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	4k	5.8	# 8
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	8k	1.8	# 9
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	16k	1.0	# 5
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	1k	37.0	# 8
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	6k	3.2	# 8
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	12k	1.9	# 6
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	2k	10.7	# 10
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	4k	5.7	# 9
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	8k	1.9	# 8
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	16k	0.8	# 7
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	1k	32.4	# 9
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	6k	3.1	# 9
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	12k	1.6	# 7
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	2k	29.2	# 6
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	4k	13.1	# 6
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	8k	2.2	# 7
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	16k	0.9	# 6
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	1k	53.4	# 6
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	6k	4.3	# 7
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	12k	1.4	# 8
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-13b-v1.5-16k	2k	5.4	# 3
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-13b-v1.5-16k	4k	5.0	# 3
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-13b-v1.5-16k	8k	2.4	# 7
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-13b-v1.5-16k	16k	3.1	# 5
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-7b-v1.5-16k	2k	5.3	# 4
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-7b-v1.5-16k	4k	2.2	# 9
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-7b-v1.5-16k	8k	2.3	# 8
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-7b-v1.5-16k	16k	1.7	# 8
Long-Context Understanding	Ada-LEval (TSort)	LongChat-7b-v1.5-32k	2k	5.3	# 4
Long-Context Understanding	Ada-LEval (TSort)	LongChat-7b-v1.5-32k	4k	5.0	# 3
Long-Context Understanding	Ada-LEval (TSort)	LongChat-7b-v1.5-32k	8k	3.1	# 6
Long-Context Understanding	Ada-LEval (TSort)	LongChat-7b-v1.5-32k	16k	2.5	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/judging-llm-as-a-judge-with-mt-bench-and-1/long-context-understanding-on-ada-leval-tsort)](https://paperswithcode.com/sota/long-context-understanding-on-ada-leval-tsort?p=judging-llm-as-a-judge-with-mt-bench-and-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/judging-llm-as-a-judge-with-mt-bench-and-1/long-context-understanding-on-ada-leval)](https://paperswithcode.com/sota/long-context-understanding-on-ada-leval?p=judging-llm-as-a-judge-with-mt-bench-and-1)`

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

NeurIPS 2023 · Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica ·

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

lm-sys/fastchat official

33,897

opengvlab/multi-modality-arena

364

kuk/rulm-sbs2

↳ Quickstart in

Spaces

dongping-chen/mllm-as-a-judge

bjoernpl/fasteval

Tasks

Add Remove

Chatbot

Language Modelling

Large Language Model

Long-Context Understanding

Datasets

Introduced in the Paper:

MT-Bench

Used in the Paper:

MMLU

TruthfulQA

Results from the Paper

Add Remove

Ranked #3 on Long-Context Understanding on Ada-LEval (TSort)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-7b-v1.5-16k	2k	11.1	# 8	Compare
			4k	5.8	# 8	Compare
			8k	1.8	# 9	Compare
			16k	1.0	# 5	Compare
			1k	37.0	# 8	Compare
			6k	3.2	# 8	Compare
			12k	1.9	# 6	Compare
Long-Context Understanding	Ada-LEval (BestAnswer)	LongChat-7b-v1.5-32k	2k	10.7	# 10	Compare
			4k	5.7	# 9	Compare
			8k	1.9	# 8	Compare
			16k	0.8	# 7	Compare
			1k	32.4	# 9	Compare
			6k	3.1	# 9	Compare
			12k	1.6	# 7	Compare
Long-Context Understanding	Ada-LEval (BestAnswer)	Vicuna-13b-v1.5-16k	2k	29.2	# 6	Compare
			4k	13.1	# 6	Compare
			8k	2.2	# 7	Compare
			16k	0.9	# 6	Compare
			1k	53.4	# 6	Compare
			6k	4.3	# 7	Compare
			12k	1.4	# 8	Compare
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-13b-v1.5-16k	2k	5.4	# 3	Compare
			4k	5.0	# 3	Compare
			8k	2.4	# 7	Compare
			16k	3.1	# 5	Compare
Long-Context Understanding	Ada-LEval (TSort)	Vicuna-7b-v1.5-16k	2k	5.3	# 4	Compare
			4k	2.2	# 9	Compare
			8k	2.3	# 8	Compare
			16k	1.7	# 8	Compare
Long-Context Understanding	Ada-LEval (TSort)	LongChat-7b-v1.5-32k	2k	5.3	# 4	Compare
			4k	5.0	# 3	Compare
			8k	3.1	# 6	Compare
			16k	2.5	# 7	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • GPT-4 • Label Smoothing • Layer Normalization • Linear Layer • LLaMA • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove