MMLU

152 papers with code • 3 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find MMLU models and implementations
2 papers
2,503

Datasets


Most implemented papers

Scaling Instruction-Finetuned Language Models

google-research/flan 20 Oct 2022

We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation).

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

thudm/chatglm-6b 18 Jun 2024

We introduce ChatGLM, an evolving family of large language models that we have been developing over time.

Qwen2 Technical Report

qwenlm/qwen2 15 Jul 2024

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models.

tinyBenchmarks: evaluating LLMs with fewer examples

felipemaiapolo/tinybenchmarks 22 Feb 2024

The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities.

REPLUG: Retrieval-Augmented Black-Box Language Models

ruc-nlpir/flashrag 30 Jan 2023

We introduce REPLUG, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model.

Make Your LLM Fully Utilize the Context

microsoft/FILM 25 Apr 2024

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge.

Are We Done with MMLU?

aryopg/mmlu-redux 6 Jun 2024

For example, we find that 57% of the analysed questions in the Virology subset contain errors.

DataComp-LM: In search of the next generation of training sets for language models

facebookresearch/lingua 17 Jun 2024

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models.

Training Compute-Optimal Large Language Models

karpathy/llama2.c 29 Mar 2022

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.

UL2: Unifying Language Learning Paradigms

google-research/google-research 10 May 2022

Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.