Search Results

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

4 code implementations12 Apr 2022

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants.

Code Generation Out of Distribution (OOD) Detection +2

Safe RLHF: Safe Reinforcement Learning from Human Feedback

1 code implementation19 Oct 2023

However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training.

reinforcement-learning Reinforcement Learning +1

Reinforcement Learning from Human Feedback

1 code implementation16 Apr 2025

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.

Math Philosophy +2

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

1 code implementation11 Feb 2024

We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

2 code implementations23 Aug 2022

We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs.

Language Modelling Red Teaming

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

2 code implementations29 Jul 2023

Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.

OPT: Open Pre-trained Transformer Language Models

11 code implementations2 May 2022

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning.

Decoder Hate Speech Detection +2

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

1 code implementation3 Oct 2024

However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes.

Code Generation Dialogue Generation +5

TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback

1 code implementation23 Jul 2024

These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model.

Language Modeling Language Modelling

Multi-turn Reinforcement Learning from Preference Human Feedback

1 code implementation23 May 2024

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks.

reinforcement-learning Reinforcement Learning +1