nlg evaluation
29 papers with code • 0 benchmarks • 0 datasets
Evaluate the generated text by NLG (Natural Language Generation) systems, like large language models
Benchmarks
These leaderboards are used to track progress in nlg evaluation
Most implemented papers
NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist
Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation.
Towards a Unified Multi-Dimensional Evaluator for Text Generation
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs.
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References
Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements.
Are LLM-based Evaluators Confusing NLG Quality Criteria?
Some prior work has shown that LLMs perform well in NLG evaluation for different tasks.
Why We Need New Evaluation Metrics for NLG
The majority of NLG evaluation relies on automatic metrics, such as BLEU .
A Study of Automatic Metrics for the Evaluation of Natural Language Explanations
As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations.
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e. g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc.
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
Based on the nature of information change from input to output, we classify NLG tasks into compression (e. g., summarization), transduction (e. g., text rewriting), and creation (e. g., dialog).
Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms.