Learning and Planning in Complex Action Spaces

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Continuous Control acrobot.swingup SMuZero Return 417.52 # 1
Continuous Control ball_in_cup.catch SMuZero Return 977.38 # 1
Continuous Control cartpole.balance SMuZero Return 984.86 # 1
Continuous Control cartpole.balance_sparse SMuZero Return 998.14 # 1
Continuous Control cartpole.swingup SMuZero Return 868.87 # 1
Continuous Control cartpole.swingup_sparse SMuZero Return 846.91 # 1
Continuous Control cheetah.run SMuZero Return 914.39 # 1
Continuous Control finger.spin SMuZero Return 986.38 # 1
Continuous Control finger.turn_easy SMuZero Return 972.53 # 1
Continuous Control finger.turn_hard SMuZero Return 963.07 # 1
Continuous Control hopper.hop SMuZero Return 528.24 # 1
Continuous Control hopper.stand SMuZero Return 926.5 # 1
Continuous Control pendulum.swingup SMuZero Return 837.76 # 1
Continuous Control quadruped.run SMuZero Return 923.54 # 1
Continuous Control quadruped.walk SMuZero Return 933.77 # 1
Continuous Control reacher.easy SMuZero Return 982.26 # 1
Continuous Control reacher.hard SMuZero Return 971.53 # 1
Continuous Control walker.run SMuZero Return 931.06 # 1
Continuous Control walker.stand SMuZero Return 987.79 # 1
Continuous Control walker.walk SMuZero Return 975.46 # 1

Methods