TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

27 Feb 2024  ·  Shaolei Zhang, Tian Yu, Yang Feng ·

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, they sometimes suffer from producing hallucinations, particularly in cases where they may generate untruthful responses despite possessing the correct knowledge. In this paper, we propose TruthX, an inference-time method to elicit the truthfulness of LLMs by editing their internal representations in truthful space. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLMs. Experiments show that TruthX effectively improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that the truthful space acquired by TruthX plays a pivotal role in controlling LLM to produce truthful or hallucinatory responses.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Question Answering TruthfulQA Mistral-7B-Instruct-v0.2 + TruthX MC1 0.56 # 2
MC2 0.75 # 1
Question Answering TruthfulQA LLaMa-2-7B-Chat + TruthX MC1 0.54 # 3
MC2 0.74 # 2

Methods