Incorporating prior knowledge in reinforcement learning algorithms is mainly an open question.
We propose to learn to distinguish reversible from irreversible actions for better informed decision-making in Reinforcement Learning (RL).
Sparse rewards are double-edged training signals in reinforcement learning: easy to design but hard to optimize.
Despite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck.
In this paper, we propose a reinforcement learning approach to solve a realistic scheduling problem, and apply it to an algorithm commonly executed in the high performance computing community, the Cholesky factorization.
We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms.
To do so, we cast the speaker recognition task into a sequential decision-making problem that we solve with Reinforcement Learning.
Violating constraints thus results in rejected actions or entering in a safe mode driven by an external controller, making RL agents incapable of learning from their mistakes.
In this paper: (a) We introduce and define MERL, the multi-head reinforcement learning framework we use throughout this work.
In this work, Vex is used to evaluate the impact each transition will have on learning: this criterion refines sampling and improves the policy gradient algorithm.
In this work, we use this metric to select samples that are useful to learn from, and we demonstrate that this selection can significantly improve the performance of policy gradient methods.
Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue.
Recommender systems objectives can be broadly characterized as modeling user preferences over short-or long-term time horizon.
In this paper we consider the problems of supervised classification and regression in the case where attributes and labels are functions: a data is represented by a set of functions, and the label is also a function.
Then, we propose a generative model of software dependency graphs which synthesizes graphs whose degree distribution is close to the empirical ones observed in real software systems.
Finally, we evaluate the performance of our KDE approach using both covariance and conditional covariance kernels on two structured output problems, and compare it to the state-of-the-art kernel-based structured output regression methods.