no code implementations • 18 Apr 2024 • Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn
Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm.
1 code implementation • 20 Oct 2023 • Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase.
1 code implementation • 21 Jun 2023 • Joey Hejna, Pieter Abbeel, Lerrel Pinto
Complex, long-horizon planning and its combinatorial nature pose steep challenges for learning-based agents.
1 code implementation • 26 Apr 2023 • Joey Hejna, Jensen Gao, Dorsa Sadigh
To bridge the gap between IL and RL, we introduce Distance Weighted Supervised Learning or DWSL, a supervised method for learning goal-conditioned policies from offline data.
3 code implementations • 5 Jan 2023 • Divyansh Garg, Joey Hejna, Matthieu Geist, Stefano Ermon
Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy.
no code implementations • 6 Dec 2022 • Joey Hejna, Dorsa Sadigh
Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning.