Safety Alignment
120 papers with code • 0 benchmarks • 1 datasets
Benchmarks
These leaderboards are used to track progress in Safety Alignment
Most implemented papers
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}.
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis
We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation.
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming.
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences.
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Large Vision-Language Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities (e. g., images).
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
To clear up concern, this paper provide a comprehensive overview to three aspects of harmful fine-tuning: attacks setting, defense design and evaluation methodology.
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries.
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs).
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.