MMLU
152 papers with code • 3 benchmarks • 1 datasets
Libraries
Use these libraries to find MMLU models and implementationsMost implemented papers
Scaling Instruction-Finetuned Language Models
We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation).
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
We introduce ChatGLM, an evolving family of large language models that we have been developing over time.
Qwen2 Technical Report
This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models.
tinyBenchmarks: evaluating LLMs with fewer examples
The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities.
REPLUG: Retrieval-Augmented Black-Box Language Models
We introduce REPLUG, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model.
Make Your LLM Fully Utilize the Context
While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge.
Are We Done with MMLU?
For example, we find that 57% of the analysed questions in the Virology subset contain errors.
DataComp-LM: In search of the next generation of training sets for language models
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models.
Training Compute-Optimal Large Language Models
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.
UL2: Unifying Language Learning Paradigms
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.