Quantitative Certification of Bias in Large Language Models

29 May 2024  ·  Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta, Gagandeep Singh ·

Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate LLM bias, as they can not scale to large number of inputs and provide no guarantees. Therefore, we propose the first framework, QuaCer-B that certifies LLMs for bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of prompts mentioning various demographic groups, sampled from a distribution. We illustrate the bias certification for distributions of prompts created by applying varying prefixes drawn from a prefix distributions, to a given set of prompts. We consider prefix distributions for random token sequences, mixtures of manual jailbreaks, and jailbreaks in the LLM's embedding space to certify bias. We obtain non-trivial certified bounds on the probability of unbiased responses of SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive distributions of prefixes.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods