Safety Alignment

120 papers with code • 0 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

git-disl/vaccine 18 Aug 2024

To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}.

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

bmpixel/safety-residual-space 13 Feb 2025

We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation.

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

declare-lab/red-instruct 18 Aug 2023

In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming.

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

ZHZisZZ/modpo 5 Oct 2023

A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences.

FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts

thuccslab/figstep 9 Nov 2023

Large Vision-Language Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities (e. g., images).

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

renqibing/CodeAttack 12 Mar 2024

The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

git-disl/awesome_llm-harmful-fine-tuning-papers 26 Sep 2024

To clear up concern, this paper provide a comprehensive overview to three aspects of harmful fine-tuning: attacks setting, defense design and evaluation methodology.

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

renqibing/actorattack 14 Oct 2024

This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries.

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

pku-alignment/safe-sora NeurIPS 2023

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs).

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

robustnlp/cipherchat 12 Aug 2023

We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.