Offline reinforcement learning (RL) has attracted much attention due to its ability in learning from static offline datasets and eliminating the need of interacting with the environment.
In this paper, we study out-of-distribution (OOD) generalization of offline GCRL both theoretically and empirically to identify factors that are important.
In this paper, we aim to introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty.
To empirically show the advantages of TATU, we first combine it with two classical model-based offline RL algorithms, MOPO and COMBO.
As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI).
We validate our insight on a range of RL tasks and show its improvement over baselines: (1) In offline RL, the conservative exploitation leads to improved performance based on off-the-shelf algorithms; (2) In online continuous control, multiple value functions with different shifting constants can be used to tackle the exploration-exploitation dilemma for better sample efficiency; (3) In discrete control tasks, a negative reward shifting yields an improvement over the curiosity-based exploration method.
Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e. g., a simulator).
Keeping risk under control is often more crucial than maximizing expected rewards in real-world decision-making situations, such as finance, robotics, autonomous driving, etc.
The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated.
Offline reinforcement learning (RL) provides a promising direction to exploit massive amount of offline data for complex decision-making tasks.
To deal with this unorthodox problem, we introduce a pseudo mean to transform the untreatable MDP to a standard one with a redefined reward function in standard form and derive a discounted mean-variance performance difference formula.
Offline reinforcement learning (RL) shows promise of applying RL to real-world problems by effectively utilizing previously collected data.
Head detection in the indoor video is an essential component of building occupancy detection.
Our work provides a unified framework of the trust region approach including both the discounted and average criteria, which may complement the framework of reinforcement learning beyond the discounted objectives.
Moreover, we extend ICQ to multi-agent tasks by decomposing the joint-policy under the implicit constraint.
First, we uncover and demonstrate the bias alleviation property of double actors by building double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation bias in TD3 respectively.
Value-based methods of multi-agent reinforcement learning (MARL), especially the value decomposition methods, have been demonstrated on a range of challenging cooperative tasks.
In these tasks, our approach learns a diverse set of options, each of whose state-action space has strong coherence.
The generative adversarial imitation learning (GAIL) has provided an adversarial learning framework for imitating expert policy from demonstrations in high-dimensional continuous tasks.
In this paper, we present a new reinforcement learning (RL) algorithm called Distributional Soft Actor Critic (DSAC), which exploits the distributional information of accumulated rewards to achieve better performance.