However, we argue that current n-gram overlap based measures that are used as rewards can be improved by using model-based rewards transferred from tasks that directly compare the similarity of sentence pairs.
Different from existing policy gradient methods which employ single actor-critic but cannot realize satisfactory tracking control accuracy and stable learning, our proposed algorithm can achieve high-level tracking control accuracy of AUVs and stable learning by applying a hybrid actors-critics architecture, where multiple actors and critics are trained to learn a deterministic policy and action-value function, respectively.
In detail, we prove that neural natural policy gradient converges to a globally optimal policy at a sublinear rate.
This can be attributed, at least in part, to the high variance in estimating the gradient of the task objective with Monte Carlo methods.
This paper proposes a definition of system health in the context of multiple agents optimizing a joint reward function.
However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior.
Motivated by the demand for an effective deep reinforcement learning algorithm that accommodates sparse reward environment, this paper presents Hindsight Trust Region Policy Optimization (Hindsight TRPO), a method that efficiently utilizes interactions in sparse reward conditions and maintains learning stability by restricting variance during the policy update process.
Under a further strict saddle points assumption, this result establishes convergence to essentially locally optimal policies of the underlying problem, and thus bridges the gap in existing literature on the convergence of PG methods.