HaluEval is a large-scale hallucination evaluation benchmark designed for Large Language Models (LLMs). It provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate the performance of LLMs in recognizing hallucinations¹².

Here are the key details about the HaluEval dataset:

  1. Purpose and Overview:
  2. Purpose: HaluEval aims to understand the types of content and the extent to which LLMs are prone to hallucinate.
  3. Content: It includes both general user queries with ChatGPT responses and task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization.
  4. Data Sources:

    • For general user queries, HaluEval adopts the 52K instruction tuning dataset from Alpaca.
    • Task-specific examples are generated based on existing task datasets (e.g., HotpotQA, OpenDialKG, CNN/Daily Mail) as seed data.
  5. Data Composition:

  6. General User Queries:
    • 5,000 user queries paired with ChatGPT responses.
    • Queries are selected based on low-similarity responses to identify potential hallucinations.
  7. Task-Specific Examples:

    • 30,000 examples from three tasks:
    • Question Answering: Based on HotpotQA as seed data.
    • Knowledge-Grounded Dialogue: Based on OpenDialKG as seed data.
    • Text Summarization: Based on CNN/Daily Mail as seed data.
  8. Data Release:

  9. The dataset contains 35,000 generated and human-annotated hallucinated samples used in experiments.
  10. JSON files include:
    • qa_data.json: Hallucinated QA samples.
    • dialogue_data.json: Hallucinated dialogue samples.
    • summarization_data.json: Hallucinated summarization samples.
    • general_data.json: Human-annotated ChatGPT responses to general user queries.

Source: Conversation with Bing, 3/17/2024 (1) HaluEval: A Hallucination Evaluation Benchmark for LLMs. https://github.com/RUCAIBox/HaluEval. (2) jzjiao/halueval-sft · Datasets at Hugging Face. https://huggingface.co/datasets/jzjiao/halueval-sft. (3) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://aclanthology.org/2023.emnlp-main.397/. (4) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://arxiv.org/abs/2305.11747. (5) undefined. https://github.com/RUCAIBox/HaluEval%29.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages