no code implementations • 10 Dec 2022 • Hsin-En Su, Yen-ju Chen, Ping-Chun Hsieh, Xi Liu
In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient.