Search Results for author: Giles Edkins

Found 1 papers, 0 papers with code

Mitigating Unsafe Feedback with Learning Constraints

no code implementations19 Sep 2024 Domenic Rosati, Giles Edkins, Harsh Raj, David Atanasov, Subhabrata Majumdar, Janarthanan Rajendran, Frank Rudzicz, Hassan Sajjad

While there has been progress towards aligning Large Language Models (LLMs) with human values and ensuring safe behaviour at inference time, safety-guards can easily be removed when fine-tuned on unsafe and harmful datasets. While this setting has been treated extensively, another popular training paradigm, learning from unsafe feedback with reinforcement learning, has previously been unexplored.

Safety Alignment Text Generation

Cannot find the paper you are looking for? You can Submit a new open access paper.