Hex is a turn-based two-player connection game with a high branching factor, making the game arbitrarily complex with increasing board sizes.
We present Nonparametric Approximation of Inter-Trace returns (NAIT), a Reinforcement Learning algorithm for discrete action, pixel-based environments that is both highly sample and computation efficient.
Ranked #6 on Atari Games 100k on Atari 100k
We consider multi-agent reinforcement learning (MARL) for cooperative communication and coordination tasks.
Enabling reinforcement learning (RL) agents to leverage a knowledge base while learning from experience promises to advance RL in knowledge intensive domains.
This paper introduces Multi-Agent MDP Homomorphic Networks, a class of networks that allows distributed execution using only local information, yet is able to share experience between global symmetries in the joint state-action space of cooperative multi-agent systems.
Combining the benefits of planning and learning values, we propose the Value Refinement Network (VRN), an architecture that locally refines a plan in a (simpler) state space abstraction, represented by a pre-computed value function, with respect to the full agent state.
In simulated robotic navigation tasks, VI-RL results in consistent strong improvement over vanilla RL, is on par with vanilla hierarchal RL on single layouts but more broadly applicable to multiple layouts, and is on par with trainable HL path planning baselines except for a parking task with difficult non-holonomic dynamics where it shows marked improvements.
Exploration is an essential component of reinforcement learning algorithms, where agents need to learn how to predict and control unknown and often stochastic environments.
Reinforcement learning is a promising paradigm for solving sequential decision-making problems, but low data efficiency and weak generalization across tasks are bottlenecks in real-world applications.
In policy search methods for reinforcement learning (RL), exploration is often performed by injecting noise either in action space at each step independently or in parameter space over each full trajectory.
In today's clinical practice, magnetic resonance imaging (MRI) is routinely accelerated through subsampling of the associated Fourier domain.
Neural processes (NPs) constitute a family of variational approximate models for stochastic processes with promising properties in computational efficiency and uncertainty quantification.
Air traffic control is becoming a more and more complex task due to the increasing number of aircraft.
MDP homomorphic networks are neural networks that are equivariant under symmetries in the joint state-action space of an MDP.
In contrast to self-empowerment, a robot employing our approach strives for the empowerment of people in its environment, so they are not disturbed by the robot's presence and motion.
We derive an unbiased estimator for expectations over discrete random variables based on sampling without replacement, which reduces variance as it avoids duplicate samples.
PAC Bayes is a generalized framework which is more resistant to overfitting and that yields performance bounds that hold with arbitrarily high probability even on the unjustified extrapolations.
In this paper, we present a search algorithm that generates efficient trajectories that optimize the rate at which probability mass is covered by a searcher.
We show how to implicitly apply this 'Gumbel-Top-$k$' trick on a factorized distribution over sequences, allowing to draw exact samples without replacement using a Stochastic Beam Search.
Learned controllers such as neural networks typically do not have a notion of uncertainty that allows to diagnose an offset between training and testing conditions, and potentially intervene.
In this work, we propose a novel method for training neural networks to perform single-document extractive summarization without heuristically-generated extractive labels.
Ranked #9 on Extractive Text Summarization on CNN / Daily Mail
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies.
A naive application of unsupervised dimensionality reduction methods to the context variables, such as principal component analysis, is insufficient as task-relevant input may be ignored.