TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Common Sense Reasoning	ARC (Easy)	Mistral 7B (0-shot)	Accuracy	80.5	# 11
Common Sense Reasoning	ARC (Easy)	Mixtral 8x7B (0-shot)	Accuracy	83.1	# 9
Code Generation	HumanEval	Mistral 7B (0-shot)	Pass@1	26.2	# 86
Code Generation	HumanEval	Mixtral 8x7B (0-shot)	Pass@1	40.2	# 60
Math Word Problem Solving	MATH	Mixtral 8x7B (maj@4)	Accuracy	28.4	# 67
Math Word Problem Solving	MATH	Mistral 7B (maj@4)	Accuracy	12.7	# 86
Math Word Problem Solving	MATH	Mistral 7B (maj@4)	Parameters (Billions)	7	# 58
Code Generation	MBPP	Mixtral 8x7B (3-shot)	Accuracy	60.7	# 35
Multi-task Language Understanding	MMLU	Mistral 7B (5-shot)	Average (%)	62.5	# 47
Multi-task Language Understanding	MMLU	Mixtral 8x7B (5-shot)	Average (%)	70.6	# 28
Question Answering	PIQA	Mixtral 8x7B (0-shot)	Accuracy	83.6	# 9
Question Answering	PIQA	Mistral 7B (0-shot)	Accuracy	82.2	# 16
Common Sense Reasoning	WinoGrande	Mixtral 8x7B (0-shot)	Accuracy	77.2	# 17
Common Sense Reasoning	WinoGrande	Mistral 7B (0-shot)	Accuracy	74.2	# 25

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixtral-of-experts/common-sense-reasoning-on-arc-easy)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-easy?p=mixtral-of-experts)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixtral-of-experts/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=mixtral-of-experts)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixtral-of-experts/common-sense-reasoning-on-winogrande)](https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande?p=mixtral-of-experts)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixtral-of-experts/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=mixtral-of-experts)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixtral-of-experts/code-generation-on-mbpp)](https://paperswithcode.com/sota/code-generation-on-mbpp?p=mixtral-of-experts)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixtral-of-experts/code-generation-on-humaneval)](https://paperswithcode.com/sota/code-generation-on-humaneval?p=mixtral-of-experts)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixtral-of-experts/math-word-problem-solving-on-math)](https://paperswithcode.com/sota/math-word-problem-solving-on-math?p=mixtral-of-experts)`

Mixtral of Experts

8 Jan 2024 · Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed ·

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

PDF Abstract

Code

Add Remove Mark official

hit-scir/chinese-mixtral-8x7b

617

ymcui/chinese-mixtral

502

consequentai/fneval

Tasks

Add Remove

Code Generation

Common Sense Reasoning

Language Modelling

Math Word Problem Solving

Multi-task Language Understanding

Question Answering

Datasets

Natural Questions

MMLU

GSM8K

TriviaQA

HumanEval

HellaSwag

MATH

PIQA

WinoGrande

The Pile MBPP

ARC (AI2 Reasoning Challenge) BBQ

Results from the Paper

Add Remove

Ranked #9 on Question Answering on PIQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Common Sense Reasoning	ARC (Easy)	Mistral 7B (0-shot)	Accuracy	80.5	# 11	Compare
Common Sense Reasoning	ARC (Easy)	Mixtral 8x7B (0-shot)	Accuracy	83.1	# 9	Compare
Code Generation	HumanEval	Mistral 7B (0-shot)	Pass@1	26.2	# 86	Compare
Code Generation	HumanEval	Mixtral 8x7B (0-shot)	Pass@1	40.2	# 60	Compare
Math Word Problem Solving	MATH	Mixtral 8x7B (maj@4)	Accuracy	28.4	# 67	Compare
Math Word Problem Solving	MATH	Mistral 7B (maj@4)	Accuracy	12.7	# 86	Compare
Math Word Problem Solving	MATH	Mistral 7B (maj@4)	Parameters (Billions)	7	# 58	Compare
Code Generation	MBPP	Mixtral 8x7B (3-shot)	Accuracy	60.7	# 35	Compare
Multi-task Language Understanding	MMLU	Mistral 7B (5-shot)	Average (%)	62.5	# 47	Compare
Multi-task Language Understanding	MMLU	Mixtral 8x7B (5-shot)	Average (%)	70.6	# 28	Compare
Question Answering	PIQA	Mixtral 8x7B (0-shot)	Accuracy	83.6	# 9	Compare
Question Answering	PIQA	Mistral 7B (0-shot)	Accuracy	82.2	# 16	Compare
Common Sense Reasoning	WinoGrande	Mixtral 8x7B (0-shot)	Accuracy	77.2	# 17	Compare
Common Sense Reasoning	WinoGrande	Mistral 7B (0-shot)	Accuracy	74.2	# 25	Compare

Methods

Add Remove

Adam • Attention Dropout • BASE • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • LLaMA • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

Mixtral of Experts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove