Safety Alignment

44 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

ZHZisZZ/modpo 5 Oct 2023

A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences.

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

rotaryhammer/code-autodan 23 Oct 2023

Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

thuccslab/figstep 9 Nov 2023

Ensuring the safety of artificial intelligence-generated content (AIGC) is a longstanding topic in the artificial intelligence (AI) community, and the safety concerns associated with Large Language Models (LLMs) have been widely investigated.

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

pku-alignment/safe-sora NeurIPS 2023

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs).

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

robustnlp/cipherchat 12 Aug 2023

We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

declare-lab/red-instruct 18 Aug 2023

In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming.

All Languages Matter: On the Multilingual Safety of Large Language Models

jarviswang94/multilingual_safety_benchmark 2 Oct 2023

In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice.

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench

cuhk-arise/psychobench 2 Oct 2023

Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education.

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

llm-tuning-safety/llms-finetuning-safety 5 Oct 2023

Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning.

SuperHF: Supervised Iterative Learning from Human Feedback

openfeedback/superhf 25 Oct 2023

Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).