Exploration by Random Network Distillation

We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.

PDF Abstract ICLR 2019 PDF ICLR 2019 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Atari Games Atari 2600 Gravitar RND Score 3906 # 11
Atari Games Atari 2600 Montezuma's Revenge RND Score 8152 # 5
Atari Games Atari 2600 Pitfall! RND Score -3 # 20
Atari Games Atari 2600 Private Eye RND Score 8666 # 11
Atari Games Atari 2600 Solaris RND Score 3282 # 17
Atari Games Atari 2600 Venture RND Score 1859 # 9
Unsupervised Reinforcement Learning URLB (pixels, 10^5 frames) APT Walker (mean normalized return) 7.71±7.39 # 7
Quadruped (mean normalized return) 21.22±5.14 # 7
Jaco (mean normalized return) 0.37±0.64 # 9
Unsupervised Reinforcement Learning URLB (pixels, 10^5 frames) RND Walker (mean normalized return) 23.87±10.21 # 3
Quadruped (mean normalized return) 24.37±8.70 # 4
Jaco (mean normalized return) 26.22±4.83 # 1
Unsupervised Reinforcement Learning URLB (pixels, 10^6 frames) RND Walker (mean normalized return) 30.46±14.18 # 4
Quadruped (mean normalized return) 41.89±11.72 # 1
Jaco (mean normalized return) 24.38±3.92 # 3
Unsupervised Reinforcement Learning URLB (pixels, 2*10^6 frames) RND Walker (mean normalized return) 32.80±13.19 # 3
Quadruped (mean normalized return) 42.57±11.65 # 2
Jaco (mean normalized return) 27.51±7.12 # 4
Unsupervised Reinforcement Learning URLB (pixels, 5*10^5 frames) RND Walker (mean normalized return) 25.44±9.92 # 4
Quadruped (mean normalized return) 36.02±10.27 # 1
Jaco (mean normalized return) 26.62±2.75 # 2
Unsupervised Reinforcement Learning URLB (states, 10^5 frames) RND Walker (mean normalized return) 82.57±31.22 # 2
Quadruped (mean normalized return) 35.34±11.16 # 3
Jaco (mean normalized return) 72.84±6.87 # 2
Unsupervised Reinforcement Learning URLB (states, 10^6 frames) RND Walker (mean normalized return) 84.93±29.64 # 1
Quadruped (mean normalized return) 69.12±11.95 # 2
Jaco (mean normalized return) 60.68±8.49 # 4
Unsupervised Reinforcement Learning URLB (states, 2*10^6 frames) RND Walker (mean normalized return) 79.28±30.91 # 1
Quadruped (mean normalized return) 75.14±16.23 # 2
Jaco (mean normalized return) 56.05±8.73 # 4
Unsupervised Reinforcement Learning URLB (states, 5*10^5 frames) RND Walker (mean normalized return) 87.15±27.65 # 1
Quadruped (mean normalized return) 59.90±12.95 # 1
Jaco (mean normalized return) 65.08±5.45 # 3

Methods


No methods listed for this paper. Add relevant methods here