71 papers with code ยท
Methodology

The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.

No evaluation results yet. Help compare methods by
submit
evaluation metrics.

This paper presents ModelicaGym toolbox that was developed to employ Reinforcement Learning (RL) for solving optimization and control tasks in Modelica models.

To achieve the above goal we employ reinforcement learning and particularly Deep Q-learning (DQN) to learn optimal push policies by trial and error.

The algorithm is of the gradient type (and therefore has good convergence properties even when used in conjunction with function approximators such as neural networks); it is off-policy; and it specifies both the update equations and the strategy to address the exploration-exploitation dilemma.

As more and more application providers transition to the cloud and deliver their services on a Software as a Service (SaaS) basis, cloud providers need to make their provisioning systems agile enough to deliver on Service Level Agreements.

The experiments show that learning high-level knowledge in the form of reward machines can lead to fast convergence to optimal policies in RL, while standard RL methods such as q-learning and hierarchical RL methods fail to converge to optimal policies after a substantial number of training steps in many tasks.

We introduce a new problem named "grasping the invisible", where a robot is tasked to grasp an initially invisible target object via a sequence of non-prehensile (e. g., pushing) and prehensile (e. g., grasping) actions.

While this has been initially proposed for Markov Decision Processes (MDPs) in tabular settings, it was recently shown that a similar principle leads to significant improvements over vanilla SQL in RL for high-dimensional domains with discrete actions and function approximators.

Motivated by the widespread use of temporal-difference (TD-) and Q-learning algorithms in reinforcement learning, this paper studies a class of biased stochastic approximation (SA) procedures under a mild "ergodic-like" assumption on the underlying stochastic noise sequence.

In this paper, we use an aerial base station (aerial-BS) to enhance fairness in a dynamic environment with user mobility.

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps.