HaluEval

Introduced by Li et al. in HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

HaluEval is a large-scale hallucination evaluation benchmark designed for Large Language Models (LLMs). It provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate the performance of LLMs in recognizing hallucinations¹².

Here are the key details about the HaluEval dataset:

Purpose and Overview:
Purpose: HaluEval aims to understand the types of content and the extent to which LLMs are prone to hallucinate.
Content: It includes both general user queries with ChatGPT responses and task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization.
Data Sources:
- For general user queries, HaluEval adopts the 52K instruction tuning dataset from Alpaca.
- Task-specific examples are generated based on existing task datasets (e.g., HotpotQA, OpenDialKG, CNN/Daily Mail) as seed data.
Data Composition:
General User Queries:
- 5,000 user queries paired with ChatGPT responses.
- Queries are selected based on low-similarity responses to identify potential hallucinations.
Task-Specific Examples:
- 30,000 examples from three tasks:
- Question Answering: Based on HotpotQA as seed data.
- Knowledge-Grounded Dialogue: Based on OpenDialKG as seed data.
- Text Summarization: Based on CNN/Daily Mail as seed data.
Data Release:
The dataset contains 35,000 generated and human-annotated hallucinated samples used in experiments.
JSON files include:
- qa_data.json: Hallucinated QA samples.
- dialogue_data.json: Hallucinated dialogue samples.
- summarization_data.json: Hallucinated summarization samples.
- general_data.json: Human-annotated ChatGPT responses to general user queries.

Source: Conversation with Bing, 3/17/2024 (1) HaluEval: A Hallucination Evaluation Benchmark for LLMs. https://github.com/RUCAIBox/HaluEval. (2) jzjiao/halueval-sft · Datasets at Hugging Face. https://huggingface.co/datasets/jzjiao/halueval-sft. (3) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://aclanthology.org/2023.emnlp-main.397/. (4) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://arxiv.org/abs/2305.11747. (5) undefined. https://github.com/RUCAIBox/HaluEval%29.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

HaluEval

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

SafetyBench

WMT 2021 - Multilingual Low-Resource Translation for Indo-European Languages

TruthfulQA

HaDes

Usage

License

Modalities

Languages

HaluEval

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

SafetyBench

WMT 2021 - Multilingual Low-Resource Translation for Indo-European Languages

TruthfulQA

HaDes

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages