Language Model Evaluation

31 papers with code • 0 benchmarks • 0 datasets

The task of using LLMs as evaluators of large language and vision language models.

Most implemented papers

BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

bigscience-workshop/biomedical 30 Jun 2022

Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance.

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

opendfm/scieval 25 Aug 2023

This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability.

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

dvlab-research/mr-gsm8k 28 Dec 2023

In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher.

Mind the Gap: Assessing Temporal Generalization in Neural Language Models

deepmind/deepmind-research NeurIPS 2021

Hence, given the compilation of ever-larger language modelling datasets, combined with the growing list of language-model-based NLP applications that require up-to-date factual knowledge about the world, we argue that now is the right time to rethink the static way in which we currently train and evaluate our language models, and develop adaptive language models that can remain up-to-date with respect to our ever-changing and non-stationary world.

ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning

zjunlp/SemEval2021Task4 SEMEVAL 2021

This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM).

PrOnto: Language Model Evaluations for 859 Languages

lgessler/pronto 22 May 2023

Evaluation datasets are critical resources for measuring the quality of pretrained language models.

C-STS: Conditional Semantic Textual Similarity

princeton-nlp/c-sts 24 May 2023

Semantic textual similarity (STS), a cornerstone task in NLP, measures the degree of similarity between a pair of sentences, and has broad application in fields such as information retrieval and natural language understanding.

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

kaistai/flask 20 Jul 2023

Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction.

AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

py499372727/AgentSims 8 Aug 2023

With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the ability of LLMs is an open question.

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

liyucheng09/contamination_detector 19 Sep 2023

Data contamination in model evaluation is getting increasingly prevalent as the massive training corpora of large language models often unintentionally include benchmark samples.