Language Model Evaluation
31 papers with code • 0 benchmarks • 0 datasets
The task of using LLMs as evaluators of large language and vision language models.
Benchmarks
These leaderboards are used to track progress in Language Model Evaluation
Most implemented papers
BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing
Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance.
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability.
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher.
Mind the Gap: Assessing Temporal Generalization in Neural Language Models
Hence, given the compilation of ever-larger language modelling datasets, combined with the growing list of language-model-based NLP applications that require up-to-date factual knowledge about the world, we argue that now is the right time to rethink the static way in which we currently train and evaluate our language models, and develop adaptive language models that can remain up-to-date with respect to our ever-changing and non-stationary world.
ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning
This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM).
PrOnto: Language Model Evaluations for 859 Languages
Evaluation datasets are critical resources for measuring the quality of pretrained language models.
C-STS: Conditional Semantic Textual Similarity
Semantic textual similarity (STS), a cornerstone task in NLP, measures the degree of similarity between a pair of sentences, and has broad application in fields such as information retrieval and natural language understanding.
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction.
AgentSims: An Open-Source Sandbox for Large Language Model Evaluation
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the ability of LLMs is an open question.
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation
Data contamination in model evaluation is getting increasingly prevalent as the massive training corpora of large language models often unintentionally include benchmark samples.