Red Teaming

89 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Red Teaming models and implementations

Most implemented papers

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

thestephencasper/explore_establish_exploit_llms 15 Jun 2023

Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

sherdencooper/gptfuzz 19 Sep 2023

Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates.

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

centerforaisafety/harmbench 6 Feb 2024

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

anthropics/hh-rlhf 23 Aug 2022

We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs.

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

thestephencasper/white_box_rarl 5 Sep 2022

In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities.

Red Teaming Language Model Detectors with Language Models

shizhouxing/Attack-LM-Detectors 31 May 2023

The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

declare-lab/red-instruct 18 Aug 2023

In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming.

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

princeton-sysml/jailbreak_llm 10 Oct 2023

Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack.

Large Language Model Unlearning

kevinyaobytedance/llm_unlearn 14 Oct 2023

To the best of our knowledge, our work is among the first to explore LLM unlearning.

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

chiayi-hsu/ring-a-bell 16 Oct 2023

While efforts have been made to mitigate such problems, either by implementing a safety filter at the evaluation stage or by fine-tuning models to eliminate undesirable concepts or styles, the effectiveness of these safety measures in dealing with a wide range of prompts remains largely unexplored.