Safety Alignment
44 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in Safety Alignment
Most implemented papers
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences.
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
Ensuring the safety of artificial intelligence-generated content (AIGC) is a longstanding topic in the artificial intelligence (AI) community, and the safety concerns associated with Large Language Models (LLMs) have been widely investigated.
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs).
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming.
All Languages Matter: On the Multilingual Safety of Large Language Models
In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice.
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench
Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education.
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning.
SuperHF: Supervised Iterative Learning from Human Feedback
Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).