1 code implementation • 11 Jan 2025 • Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Scott Niekum, Peter Stone
A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function.
no code implementations • 7 Dec 2024 • Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum
In this work, we show that we can achieve a zero-shot language-to-behavior policy by first grounding the imagined sequences in real observations of an unsupervised RL agent and using a closed-form solution to imitation learning that allows the RL agent to mimic the grounded observations.
1 code implementation • 29 Oct 2024 • Stephen Chung, Scott Niekum, David Krueger
Our results show that the plans of explicitly planning agents are significantly more informative for prediction than the neuron activations of the other types.
no code implementations • 24 Oct 2024 • Zizhao Wang, Jiaheng Hu, Caleb Chuck, Stephen Chen, Roberto Martín-Martín, Amy Zhang, Scott Niekum, Peter Stone
However, in complex environments with many state factors (e. g., household environments with many objects), learning skills that cover all possible states is impossible, and naively encouraging state diversity often leads to simple skills that are not ideal for solving downstream tasks.
no code implementations • 21 Jun 2024 • Ryan Boldi, Li Ding, Lee Spector, Scott Niekum
However, preferences sourced from diverse populations can result in point estimates of human values that may be sub-optimal or unfair to specific groups.
no code implementations • 13 Jun 2024 • Harshit Sikchi, Caleb Chuck, Amy Zhang, Scott Niekum
DILO reduces the learning from observations problem to that of simply learning an actor and a critic, bearing similar complexity to vanilla offline RL.
no code implementations • 5 Jun 2024 • Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, Scott Niekum
Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process.
no code implementations • 6 May 2024 • Caleb Chuck, Carl Qi, Michael J. Munje, Shuozhe Li, Max Rudolph, Chang Shi, Siddhant Agarwal, Harshit Sikchi, Abhinav Peri, Sarthak Dayal, Evan Kuo, Kavan Mehta, Anthony Wang, Peter Stone, Amy Zhang, Scott Niekum
Reinforcement Learning is a promising tool for learning complex policies even in fast-moving and object-interactive domains where human teleoperation or hard-coded policies might fail.
1 code implementation • 2 May 2024 • Prasann Singhal, Nathan Lambert, Scott Niekum, Tanya Goyal, Greg Durrett
Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO.
no code implementations • 16 Apr 2024 • Caleb Chuck, Sankaran Vaidyanathan, Stephen Giguere, Amy Zhang, David Jensen, Scott Niekum
This paper introduces functional actual cause (FAC), a framework that uses context-specific independencies in the environment to restrict the set of actual causes.
no code implementations • 25 Mar 2024 • Max Rudolph, Caleb Chuck, Kevin Black, Misha Lvovsky, Scott Niekum, Amy Zhang
Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors.
no code implementations • 3 Nov 2023 • Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, Scott Niekum
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions.
1 code implementation • 20 Oct 2023 • Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase.
1 code implementation • 3 Oct 2023 • W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum
Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return.
no code implementations • 6 Jul 2023 • Andrew Levy, Sreehari Rammohan, Alessandro Allievi, Scott Niekum, George Konidaris
Our framework makes two specific contributions.
no code implementations • 15 Jun 2023 • Caleb Chuck, Kevin Black, Aditya Arjun, Yuke Zhu, Scott Niekum
Reinforcement Learning (RL) has demonstrated promising results in learning policies for complex tasks, but it often suffers from low sample efficiency and limited transferability.
1 code implementation • 16 Feb 2023 • Harshit Sikchi, Qinqing Zheng, Amy Zhang, Scott Niekum
For offline RL, our analysis frames a recent offline RL method XQL in the dual framework, and we further propose a new method f-DVL that provides alternative choices to the Gumbel regression loss that fixes the known training instability issue of XQL.
no code implementations • 24 Jan 2023 • Prasoon Goyal, Raymond J. Mooney, Scott Niekum
We introduce a novel setting, wherein an agent needs to learn a task from a demonstration of a related task with the difference between the tasks communicated in natural language.
no code implementations • 5 Jun 2022 • W. Bradley Knox, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter Stone, Alessandro Allievi
We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting.
no code implementations • 1 Jun 2022 • Wonjoon Goo, Scott Niekum
In this work, we argue that it is not only viable but beneficial to explicitly model the behavior policy for offline RL because the constraint can be realized in a stable way with the trained model.
no code implementations • 23 Apr 2022 • Yuchen Cui, Scott Niekum, Abhinav Gupta, Vikash Kumar, Aravind Rajeswaran
Task specification is at the core of programming autonomous robots.
no code implementations • 7 Feb 2022 • Harshit Sikchi, Akanksha Saran, Wonjoon Goo, Scott Niekum
We propose a new framework for imitation learning -- treating imitation as a two-player ranking-based game between a policy and a reward.
1 code implementation • NeurIPS 2021 • Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, Scott Niekum
In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS.
no code implementations • 5 Oct 2021 • Wonjoon Goo, Scott Niekum
The goal of offline reinforcement learning (RL) is to find an optimal policy given prerecorded trajectories.
no code implementations • ICLR 2022 • Stephen Giguere, Blossom Metevier, Yuriy Brun, Philip S. Thomas, Scott Niekum, Bruno Castro da Silva
Recent studies have demonstrated that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes.
1 code implementation • 12 Aug 2021 • Ajinkya Jain, Stephen Giguere, Rudolf Lioutikov, Scott Niekum
Our core contributions include a novel representation for distributions over rigid body transformations and articulation model parameters based on screw theory, von Mises-Fisher distributions, and Stiefel manifolds.
no code implementations • 30 Jun 2021 • Farzan Memarian, Abolfazl Hashemi, Scott Niekum, Ufuk Topcu
We explore methodologies to improve the robustness of generative adversarial imitation learning (GAIL) algorithms to observation noise.
no code implementations • 5 Jun 2021 • Prasoon Goyal, Raymond J. Mooney, Scott Niekum
Imitation learning and instruction-following are two common approaches to communicate a user's intent to a learning agent.
1 code implementation • NeurIPS 2021 • Ishan Durugkar, Mauricio Tec, Scott Niekum, Peter Stone
In this paper, we investigate whether one such objective, the Wasserstein-1 distance between a policy's state visitation distribution and a target distribution, can be utilized effectively for reinforcement learning (RL) tasks.
1 code implementation • NeurIPS 2021 • Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip S. Thomas
When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy.
1 code implementation • 8 Mar 2021 • Farzan Memarian, Wonjoon Goo, Rudolf Lioutikov, Scott Niekum, Ufuk Topcu
We introduce Self-supervised Online Reward Shaping (SORS), which aims to improve the sample efficiency of any RL algorithm in sparse-reward environments by automatically densifying rewards.
1 code implementation • 2 Dec 2020 • Daniel S. Brown, Jordan Schneider, Anca D. Dragan, Scott Niekum
In this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values.
1 code implementation • 28 Sep 2020 • Yuchen Cui, Qiping Zhang, Alessandro Allievi, Peter Stone, Scott Niekum, W. Bradley Knox
We train a deep neural network on this data and demonstrate its ability to (1) infer relative reward ranking of events in the training task from prerecorded human facial reactions; (2) improve the policy of an agent in the training task using live human facial reactions; and (3) transfer to a novel domain in which it evaluates robot manipulation trajectories.
Human-Computer Interaction Robotics
1 code implementation • 24 Aug 2020 • Ajinkya Jain, Rudolf Lioutikov, Caleb Chuck, Scott Niekum
Robots in human environments will need to interact with a wide variety of articulated objects such as cabinets, drawers, and dishwashers while assisting humans in performing day-to-day tasks.
1 code implementation • ICML Workshop LaReL 2020 • Prasoon Goyal, Scott Niekum, Raymond J. Mooney
Reinforcement learning (RL), particularly in sparse reward settings, often requires prohibitively large numbers of interactions with the environment, thereby limiting its applicability to complex problems.
1 code implementation • NeurIPS 2020 • Daniel S. Brown, Scott Niekum, Marek Petrik
Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework that optimizes a policy under the assumption of an adversarial reward function, whereas risk-neutral IRL approaches either optimize a policy for the mean or MAP reward function.
no code implementations • 28 Feb 2020 • Akanksha Saran, Ruohan Zhang, Elaine Schaertl Short, Scott Niekum
Based on similarities between the attention of reinforcement learning agents and human gaze, we propose a novel approach for utilizing gaze data in a computationally efficient manner, as part of an auxiliary loss function, which guides a network to have higher activations in image regions where the human's gaze fixated.
1 code implementation • ICML 2020 • Daniel S. Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum
Bayesian REX can learn to play Atari games from demonstrations, without access to the game score and can generate 100, 000 samples from the posterior over reward functions in only 5 minutes on a personal laptop.
1 code implementation • 9 Feb 2020 • Wonjoon Goo, Scott Niekum
A central goal of meta-learning is to find a learning rule that enables fast adaptation across a set of tasks, by learning the appropriate inductive bias for that set.
no code implementations • 10 Dec 2019 • Daniel S. Brown, Scott Niekum
Bayesian inverse reinforcement learning (IRL) methods are ideal for safe imitation learning, as they allow a learning agent to reason about reward uncertainty and the safety of a learned policy.
2 code implementations • 9 Jul 2019 • Daniel S. Brown, Wonjoon Goo, Scott Niekum
The performance of imitation learning is typically upper-bounded by the performance of the demonstrator.
no code implementations • 6 Jul 2019 • Oliver Kroemer, Scott Niekum, George Konidaris
A key challenge in intelligent robotics is creating robots that are capable of directly interacting with the world around them to achieve their goals.
Robotics
no code implementations • 27 May 2019 • Caleb Chuck, Supawit Chockchowwat, Scott Niekum
Deep reinforcement learning (DRL) is capable of learning high-performing policies on a variety of complex high-dimensional tasks, ranging from video games to robotic manipulation.
no code implementations • 7 May 2019 • Yuchen Cui, David Isele, Scott Niekum, Kikuo Fujimura
Our analysis shows that UAIL outperforms existing data aggregation algorithms on a series of benchmark tasks.
3 code implementations • 12 Apr 2019 • Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum
A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator.
1 code implementation • 5 Mar 2019 • Prasoon Goyal, Scott Niekum, Raymond J. Mooney
A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal.
2 code implementations • 8 Jan 2019 • Daniel S. Brown, Yuchen Cui, Scott Niekum
Active learning from demonstration allows a robot to query a human for specific types of input to achieve efficient learning.
1 code implementation • 29 Jun 2018 • Wonjoon Goo, Scott Niekum
Due to burdensome data requirements, learning from demonstration often falls short of its promise to allow users to quickly and naturally program robots.
1 code implementation • 4 Jun 2018 • Josiah P. Hanna, Scott Niekum, Peter Stone
We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set.
1 code implementation • 20 May 2018 • Daniel S. Brown, Scott Niekum
Inverse reinforcement learning (IRL) infers a reward function from demonstrations, allowing for policy improvement and generalization.
1 code implementation • 12 Feb 2018 • Ajinkya Jain, Scott Niekum
This hierarchical planning approach results in a decomposition of the POMDP planning problem into smaller sub-parts that can be solved with significantly lower computational costs.
1 code implementation • 29 Aug 2017 • Mohammed Alshiekh, Roderick Bloem, Ruediger Ehlers, Bettina Könighofer, Scott Niekum, Ufuk Topcu
In the first one, the shield acts each time the learning agent is about to make a decision and provides a list of safe actions.
3 code implementations • 3 Jul 2017 • Daniel S. Brown, Scott Niekum
In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance.
1 code implementation • ICML 2017 • Josiah P. Hanna, Philip S. Thomas, Peter Stone, Scott Niekum
The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance.
no code implementations • 20 Jun 2016 • Josiah P. Hanna, Peter Stone, Scott Niekum
In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces.
no code implementations • NeurIPS 2015 • Philip S. Thomas, Scott Niekum, Georgios Theocharous, George Konidaris
The benefit of the Ω-return is that it accounts for the correlation of different length returns.
no code implementations • NeurIPS 2011 • Scott Niekum, Andrew G. Barto
Skill discovery algorithms in reinforcement learning typically identify single states or regions in state space that correspond to task-specific subgoals.
no code implementations • NeurIPS 2011 • George Konidaris, Scott Niekum, Philip S. Thomas
We show that the lambda-return target used in the TD(lambda) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an n-step return estimate increases with n. We introduce the gamma-return estimator, an alternative target based on a more accurate model of variance, which defines the TD_gamma family of complex-backup temporal difference learning algorithms.