Red Teaming
89 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in Red Teaming
Libraries
Use these libraries to find Red Teaming models and implementationsMost implemented papers
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates.
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs.
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents
In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities.
Red Teaming Language Model Detectors with Language Models
The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack.
Large Language Model Unlearning
To the best of our knowledge, our work is among the first to explore LLM unlearning.
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?
While efforts have been made to mitigate such problems, either by implementing a safety filter at the evaluation stage or by fine-tuning models to eliminate undesirable concepts or styles, the effectiveness of these safety measures in dealing with a wide range of prompts remains largely unexplored.