Identifying the Reward Function by Anchor Actions

ICML 2020 · Sinong Geng, Houssam Nassif, Carlos Manzanares, Max Reppen, Ronnie Sircar ·

We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. Our method sequentially estimates the policy, the $Q$-function, and the reward. We refer to it as the PQR method. This method does not require the assumption that the reward depends on the state only, but instead allows also for dependency on the choice of action. Moreover, the method allows for the state transitions to be stochastic. To accomplish this, we assume the existence of one anchor action whose reward is known, typically the action of doing nothing, yielding no reward. We present both estimators and algorithms for the PQR method. When the environment transition is known, we prove that the reward estimator of PQR uniquely recovers the true reward. With unknown transitions, convergence analysis is presented for the PQR method. Finally, we apply PQR to both synthetic and real-world datasets, demonstrating superior performance in terms of reward estimation compared to competing methods.

PDF ICML 2020 PDF

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Identifying the Reward Function by Anchor Actions

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove