We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
Genetic algorithms constitute a family of black-box optimization algorithms, which take inspiration from the principles of biological evolution.
Such applications often require to put constraints on the agent's behavior.
Optimizing functions without access to gradients is the remit of black-box methods such as evolution strategies.
By formally organising these modifications into several factors of variation, we are able to show that Analyses of Variance (ANOVA) are a potent tool for studying the effects of human-relevant domain changes on the learning and transfer performance of a deep reinforcement learning agent.
We support these results with a qualitative analysis of resulting meta-parameter schedules and learned functions of context features.
Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations.
We show that a recent successful meta RL approach that meta-learns an objective for backpropagation-based learning exhibits certain symmetries (specifically the reuse of the learning rule, and invariance to input and output permutations) that are not present in typical black-box meta RL systems.
We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark and demonstrate that it yields both performance and efficiency gains in multi-task meta-learning.
We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal.
no code implementations • 5 Oct 2020 • Sebastian Flennerhag, Jane X. Wang, Pablo Sprechmann, Francesco Visin, Alexandre Galashov, Steven Kapturowski, Diana L. Borsa, Nicolas Heess, Andre Barreto, Razvan Pascanu
Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties.
The encoder transforms market-specific data into an abstract latent representation that is processed by a global model shared by all markets, while the decoder learns a market-specific trading strategy based on both local and global information from the market-specific encoder and the global model.
On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their meta-gradients, leading to methods that can not scale beyond few-shot task adaptation.
We empirically show the superiority of this approach over conventional ensemble learning approaches and rivaling spatial data augmentation methods, using synthetic and real-world prediction tasks.
Approaches that transfer information contained only in the final parameters of a source model will therefore struggle.
Standard neural network architectures are non-linear only by virtue of a simple element-wise activation function, making them both brittle and excessively large.