The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state.
Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data.
These are combined with two novel regularization terms for the policy and value function, required to make the use of data augmentation theoretically sound for actor-critic algorithms.
In reinforcement learning, it is typical to use the empirically observed transitions and rewards to estimate the value of a policy via either model-based or Q-fitting approaches.
Our agent outperforms other baselines specifically designed to improve generalization in RL.
We propose a simple data augmentation technique that can be applied to standard model-free reinforcement learning algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training.
Ranked #1 on Continuous Control on DeepMind Cheetah Run (Images)
In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective.
In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility.
A promising approach is to learn a latent representation together with the control policy.
We identify two issues with the family of algorithms based on the Adversarial Imitation Learning framework.
When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.
Is it possible to build a system to determine the location where a photo was taken using just its pixels?
Ranked #7 on Photo geolocation estimation on Im2GPS (using extra training data)