no code implementations • 2 Feb 2024 • Prashansa Panda, Shalabh Bhatnagar
In recent years, there has been a lot of research work activity focused on carrying out asymptotic and non-asymptotic convergence analyses for two-timescale actor critic algorithms where the actor updates are performed on a timescale that is slower than that of the critic.
no code implementations • 20 Nov 2023 • Lakshmi Mandal, Chandrashekar Lakshminarayanan, Shalabh Bhatnagar
In this work, we consider a `cooperative' multi-agent Markov decision process (MDP) involving m greater than 1 agents, where all agents are aware of the system model.
no code implementations • 25 Oct 2023 • Prashansa Panda, Shalabh Bhatnagar
Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large.
no code implementations • 9 Oct 2023 • Arghyadeep Barat, Prabuchandran. K. J, Shalabh Bhatnagar
In this paper, we consider the problem of finding an optimal energy management policy for a network of sensor nodes capable of harvesting their own energy and sharing it with other nodes in the network.
no code implementations • 8 Oct 2023 • Shalabh Bhatnagar
This has advantages in the case of systems with infinite state and action spaces as it relax some of the regularity requirements that would otherwise be needed for proving convergence of the Reinforce algorithm.
no code implementations • 20 May 2023 • Arunselvan Ramaswamy, Shalabh Bhatnagar, Naman Saxena
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
no code implementations • 20 May 2023 • Naman Saxena, Subhojyoti Khastigir, Shishir Kolathaya, Shalabh Bhatnagar
In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion.
no code implementations • 21 Apr 2023 • Mizhaan Prajit Maniyar, Akash Mondal, Prashanth L. A., Shalabh Bhatnagar
We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available.
1 code implementation • 13 Mar 2023 • Lakshmi Mandal, Shalabh Bhatnagar
We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm.
no code implementations • 20 Dec 2022 • Soumen Pachal, Shalabh Bhatnagar, L. A. Prashanth
We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present the balanced versions (B-GSPSA) of these.
1 code implementation • 14 Oct 2022 • Ashish Kumar Jayant, Shalabh Bhatnagar
We compare our approach with relevant model-free and model-based approaches in Constrained RL using the challenging Safe Reinforcement Learning benchmark - the Open AI Safety Gym.
no code implementations • 10 Oct 2022 • Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin
We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale.
1 code implementation • 10 Oct 2022 • Soumyajit Guin, Shalabh Bhatnagar
In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general.
no code implementations • 30 Jul 2022 • Akash Mondal, Prashanth L. A., Shalabh Bhatnagar
In this paper, we present a stochastic gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples, and only the latter are observed for any given parameter.
no code implementations • 2 Jan 2022 • Arun Raman, Keerthan Shagrithaya, Shalabh Bhatnagar
We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied.
no code implementations • 7 Dec 2021 • Rohan Deb, Shalabh Bhatnagar
This paper presents the first sufficient conditions that guarantee the stability and almost sure convergence of $N$-timescale stochastic approximation (SA) iterates for any $N\geq1$.
no code implementations • 23 Nov 2021 • Rohan Deb, Meet Gandhi, Shalabh Bhatnagar
However, the weights assigned to different $n$-step returns in TD($\lambda$), controlled by the parameter $\lambda$, decrease exponentially with increasing $n$.
no code implementations • 22 Nov 2021 • Rohan Deb, Shalabh Bhatnagar
Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically.
no code implementations • 19 Oct 2021 • Raghuram Bharadwaj Diddigi, Prateek Jain, Prabuchandran K. J., Shalabh Bhatnagar
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL).
2 code implementations • 7 Jan 2021 • P. Parnika, Raghuram Bharadwaj Diddigi, Sai Koti Reddy Danda, Shalabh Bhatnagar
In this work, we consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting, where the objective is to optimize a common goal.
1 code implementation • 30 Oct 2020 • Kartik Paigwar, Lokesh Krishna, Sashank Tirumala, Naman Khetan, Aditya Sagi, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya
In particular, the parameters of the end-foot trajectories are shaped via a linear feedback policy that takes the torso orientation and the terrain slope as inputs.
no code implementations • 9 Oct 2020 • Dhuruva Priyan G M, Abhik Singla, Shalabh Bhatnagar
Natural gradients solves these challenges by converging the model parameters better.
no code implementations • 2 Sep 2020 • Meet Gandhi, Atreyee Kundu, Shalabh Bhatnagar
Second, we model a set of benchmark examples of hybrid control design problem in the proposed MDP framework.
no code implementations • 28 Jul 2020 • Sashank Tirumala, Sagar Gubbi, Kartik Paigwar, Aditya Sagi, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya
First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius.
1 code implementation • 6 Feb 2020 • Shravan Nayak, Chanakya Ajit Ekbote, Annanya Pratap Singh Chauhan, Raghuram Bharadwaj Diddigi, Prishita Ray, Abhinava Sikdar, Sai Koti Reddy Danda, Shalabh Bhatnagar
A microgrid is capable of generating a limited amount of energy from a renewable resource and is responsible for handling the demands of its dedicated customers.
no code implementations • 20 Nov 2019 • Akshay Dharmavaram, Matthew Riemer, Shalabh Bhatnagar
Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions.
1 code implementation • 13 Nov 2019 • Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar
In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation.
1 code implementation • 1 Nov 2019 • Indu John, Chandramouli Kamanchi, Shalabh Bhatnagar
In most RL algorithms such as Q-learning, the Bellman equation and the Bellman operator play an important role.
no code implementations • 16 Jun 2019 • Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar
This problem is formulated as a min-max Markov game in the literature.
no code implementations • 15 May 2019 • Shounak Bhattacharya, Abhik Singla, Abhimanyu, Dhaivat Dholakiya, Shalabh Bhatnagar, Bharadwaj Amrutur, Ashitava Ghosal, Shishir Kolathaya
In this work, we provide a simulation framework to perform systematic studies on the effects of spinal joint compliance and actuation on bounding performance of a 16-DOF quadruped spined robot Stoch 2.
no code implementations • 10 May 2019 • Sindhu Padakandla, Prabuchandran K. J, Shalabh Bhatnagar
In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a non-stationary environment.
2 code implementations • 10 May 2019 • Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Shalabh Bhatnagar
In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme.
no code implementations • 9 Mar 2019 • Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Shalabh Bhatnagar
We first derive a modified fixed point iteration for SOR Q-values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy.
no code implementations • 11 Feb 2019 • Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Prabuchandran K. J., Shalabh Bhatnagar
In many of the practical applications, the analytical form of the density is not known and only the samples from the distribution are available.
2 code implementations • 8 Nov 2018 • Abhik Singla, Sindhu Padakandla, Shalabh Bhatnagar
When compared to obstacle avoidance in ground vehicular robots, UAV navigation brings in additional challenges because the UAV motion is no more constrained to a well-defined indoor ground or street environment.
no code implementations • 9 Oct 2018 • Abhik Singla, Shounak Bhattacharya, Dhaivat Dholakiya, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya
Leveraging on this underlying structure, we then realize walking in Stoch by a straightforward reconstruction of joint trajectories from kMPs.
1 code implementation • 8 Aug 2018 • Prashanth L. A, Shalabh Bhatnagar, Nirav Bhavsar, Michael Fu, Steven I. Marcus
We introduce deterministic perturbation schemes for the recently proposed random directions stochastic approximation (RDSA) [17], and propose new first-order and second-order algorithms.
no code implementations • 15 Jun 2018 • Ajin George Joseph, Shalabh Bhatnagar
In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, \emph{i. e.}, estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set.
no code implementations • 22 Feb 2018 • Arunselvan Ramaswamy, Shalabh Bhatnagar, Daniel E. Quevedo
In this paper, we present verifiable sufficient conditions for stability and convergence of asynchronous SAs with biased approximation errors.
no code implementations • 31 Jan 2018 • Ajin George Joseph, Shalabh Bhatnagar
The cross entropy (CE) method is a model based search method to solve optimization problems where the objective function has minimal structure.
no code implementations • 31 Jan 2018 • Ajin George Joseph, Shalabh Bhatnagar
In this paper, we consider a modified version of the control problem in a model free Markov decision process (MDP) setting with large state and action spaces.
no code implementations • 14 Nov 2017 • Diddigi Raghuram Bharadwaj, Sai Koti Reddy Danda, Krishnasuri Narayanam, Shalabh Bhatnagar
This paper considers two important problems -- on the supply-side and demand-side respectively and studies both in a unified framework.
no code implementations • 14 Sep 2017 • Arunselvan Ramaswamy, Shalabh Bhatnagar
In this paper, we consider the stochastic iterative counterpart of the value iteration scheme wherein only noisy and possibly biased approximations of the Bellman operator are available.
no code implementations • 27 Aug 2017 • Raghuram Bharadwaj Diddigi, Prabuchandran K. J., Shalabh Bhatnagar
We consider the problem of tracking an intruder using a network of wireless sensors.
no code implementations • 25 Aug 2017 • Raghuram Bharadwaj Diddigi, D. Sai Koti Reddy, Shalabh Bhatnagar
Finally, we also consider a variant of this problem where the cost of power production at the main site is taken into consideration.
no code implementations • 22 Dec 2016 • Prasenjit Karmakar, Shalabh Bhatnagar
The novelty of our approach is that we use the irreduciblity of Markov chain to get the new bounds whereas the earlier work by Basu et al. used spectral variation bound which is true for any matrix.
no code implementations • 30 Nov 2016 • Sandeep Kumar, Sindhu Padakandla, Chandrashekar L, Priyank Parihar, K Gopinath, Shalabh Bhatnagar
Our method, when tested on a 25 node Hadoop cluster shows 66\% decrease in execution time of Hadoop jobs on an average, when compared to the default configuration.
no code implementations • 19 May 2016 • Prasenjit Karmakar, Rajkumar Maity, Shalabh Bhatnagar
In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in "online" learning environment.
no code implementations • 1 Apr 2016 • Arunselvan Ramaswamy, Shalabh Bhatnagar
The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically.
no code implementations • 27 Nov 2015 • Chandrashekar Lakshmi Narayanan, Raj Kumar Maity, Shalabh Bhatnagar
In this paper, we combine task-dependent reward shaping and task-independent proto-value functions to obtain reward dependent proto-value functions (RPVFs).
no code implementations • 28 Jul 2015 • Prashanth L. A., H. L. Prasad, Shalabh Bhatnagar, Prakash Chandra
We propose a novel actor-critic algorithm with guaranteed convergence to an optimal policy for a discounted reward Markov decision process.
no code implementations • 1 Jul 2015 • H. L. Prasad, Shalabh Bhatnagar
However, the optimization problem there has a non-linear objective and non-linear constraints with special structure.
no code implementations • 23 Apr 2015 • Arunselvan Ramaswamy, Shalabh Bhatnagar
Analyzing this class of algorithms is important, since many reinforcement learning (RL) algorithms can be cast as SAs driven by a `controlled Markov' process.
no code implementations • 31 Mar 2015 • Prasenjit Karmakar, Shalabh Bhatnagar
We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise.
no code implementations • 17 Mar 2015 • Sindhu Padakandla, Prabuchandran K. J, Shalabh Bhatnagar
We also develop a cross entropy based method that incorporates policy parameterization in order to find near optimal energy sharing policies.
1 code implementation • 19 Feb 2015 • Prashanth L. A., Shalabh Bhatnagar, Michael Fu, Steve Marcus
We prove the unbiasedness of both gradient and Hessian estimates and asymptotic (strong) convergence for both first-order and second-order schemes.
no code implementations • 6 Feb 2015 • Arunselvan Ramaswamy, Shalabh Bhatnagar
In this paper the stability theorem of Borkar and Meyn is extended to include the case when the mean field is a differential inclusion.
no code implementations • 6 Feb 2015 • Arunselvan Ramaswamy, Shalabh Bhatnagar
In this paper we present a framework to analyze the asymptotic behavior of two timescale stochastic approximation algorithms including those with set-valued mean fields.
no code implementations • NeurIPS 2014 • Hengshuai Yao, Csaba Szepesvari, Richard S. Sutton, Joseph Modayil, Shalabh Bhatnagar
We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function.
no code implementations • 8 Jan 2014 • H. L. Prasad, L. A. Prashanth, Shalabh Bhatnagar
We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game - Sub-Problem) conditions.
no code implementations • 27 Dec 2013 • Prashanth L. A., Abhranil Chatterjee, Shalabh Bhatnagar
For each criterion, we propose a convergent on-policy Q-learning algorithm that operates on two timescales, while employing function approximation to handle the curse of dimensionality associated with the underlying POMDP.
no code implementations • 21 Jun 2012 • Debarghya Ghoshdastidar, Ambedkar Dukkipati, Shalabh Bhatnagar
This motivates us to study SF schemes for gradient estimation using the q-Gaussian distribution.
no code implementations • NeurIPS 2009 • Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks.
no code implementations • NeurIPS 2009 • Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári
We extend Dyna planning architecture for policy evaluation and control in two significant aspects.