1 code implementation • 31 Mar 2024 • Mohamed Elsayed, A. Rupam Mahmood
Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units.
1 code implementation • 23 Dec 2023 • Bram Grooten, Tristan Tomilin, Gautham Vasan, Matthew E. Taylor, A. Rupam Mahmood, Meng Fang, Mykola Pechenizkiy, Decebal Constantin Mocanu
Our algorithm improves the agent's focus with useful masks, while its efficient Masker network only adds 0. 2% more parameters to the original structure, in contrast to previous work.
no code implementations • 2 Oct 2023 • Qingfeng Lan, A. Rupam Mahmood
We show that by simply replacing classical activation functions with elephant activation functions, we can significantly improve the resilience of neural networks to catastrophic forgetting.
1 code implementation • 23 Jun 2023 • Shibhansh Dohare, J. Fernando Hernandez-Garcia, Parash Rahman, A. Rupam Mahmood, Richard S. Sutton
If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples.
no code implementations • 23 Jun 2023 • Fengdi Che, Gautham Vasan, A. Rupam Mahmood
The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the \emph{discounted stationary distribution}.
1 code implementation • 29 May 2023 • Haque Ishfaq, Qingfeng Lan, Pan Xu, A. Rupam Mahmood, Doina Precup, Anima Anandkumar, Kamyar Azizzadenesheli
One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings.
1 code implementation • 9 May 2023 • Homayoon Farrahi, A. Rupam Mahmood
In this work, we investigate the widely-used baseline hyper-parameter values of two policy gradient algorithms -- PPO and SAC -- across different cycle times.
no code implementations • 7 Feb 2023 • Mohamed Elsayed, A. Rupam Mahmood
Modern representation learning methods often struggle to adapt quickly under non-stationarity because they suffer from catastrophic forgetting and decaying plasticity.
1 code implementation • 3 Feb 2023 • Qingfeng Lan, A. Rupam Mahmood, Shuicheng Yan, Zhongwen Xu
Reinforcement learning (RL) is essentially different from supervised learning and in practice these learned optimizers do not work well even in simple RL tasks.
1 code implementation • 6 Dec 2022 • Amirmohammad Karimi, Jun Jin, Jun Luo, A. Rupam Mahmood, Martin Jagersand, Samuele Tosatto
In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals.
1 code implementation • 20 Oct 2022 • Mohamed Elsayed, A. Rupam Mahmood
Second-order optimization uses curvature information about the objective function, which can help in faster convergence.
2 code implementations • 5 Oct 2022 • Yan Wang, Gautham Vasan, A. Rupam Mahmood
A common setup for a robotic agent is to have two different computers simultaneously: a resource-limited local computer tethered to the robot and a powerful remote computer connected wirelessly.
1 code implementation • 22 May 2022 • Qingfeng Lan, Yangchen Pan, Jun Luo, A. Rupam Mahmood
The experience replay buffer, a standard component in deep reinforcement learning, is often used to reduce forgetting and improve sample efficiency by storing experiences in a large buffer and using them for training later.
1 code implementation • 23 Mar 2022 • Yufeng Yuan, A. Rupam Mahmood
An oft-ignored challenge of real-world reinforcement learning is that the real world does not pause when agents make learning updates.
1 code implementation • 4 Feb 2022 • Samuele Tosatto, Andrew Patterson, Martha White, A. Rupam Mahmood
The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient.
1 code implementation • 22 Dec 2021 • Shivam Garg, Samuele Tosatto, Yangchen Pan, Martha White, A. Rupam Mahmood
Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions.
1 code implementation • 13 Aug 2021 • Shibhansh Dohare, Richard S. Sutton, A. Rupam Mahmood
The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former.
no code implementations • 17 Jul 2021 • Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood, Martha White
Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification.
1 code implementation • 9 Mar 2021 • Qingfeng Lan, Samuele Tosatto, Homayoon Farrahi, A. Rupam Mahmood
As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent.
1 code implementation • 27 Mar 2019 • Dmytro Korenkevych, A. Rupam Mahmood, Gautham Vasan, James Bergstra
We introduce a family of stationary autoregressive (AR) stochastic processes to facilitate exploration in continuous control domains.
2 code implementations • 20 Sep 2018 • A. Rupam Mahmood, Dmytro Korenkevych, Gautham Vasan, William Ma, James Bergstra
The research community is now able to reproduce, analyze and build quickly on these results due to open source implementations of learning algorithms and simulated benchmark tasks.
2 code implementations • 19 Mar 2018 • A. Rupam Mahmood, Dmytro Korenkevych, Brent J. Komer, James Bergstra
Reinforcement learning is a promising approach to developing hard-to-engineer adaptive solutions for complex and diverse robotic tasks.
no code implementations • 14 Apr 2017 • Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton
As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process.
1 code implementation • 13 Dec 2015 • Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton
Our results suggest that the true online methods indeed dominate the regular methods.
no code implementations • 6 Jul 2015 • A. Rupam Mahmood, Huizhen Yu, Martha White, Richard S. Sutton
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps.
no code implementations • 1 Jul 2015 • Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Richard S. Sutton
Our results confirm the strength of true online TD({\lambda}): 1) for sparse feature vectors, the computational overhead with respect to TD({\lambda}) is minimal; for non-sparse features the computation time is at most twice that of TD({\lambda}), 2) across all domains/representations the learning speed of true online TD({\lambda}) is often better, but never worse than that of TD({\lambda}), and 3) true online TD({\lambda}) is easier to use, because it does not require choosing between trace types, and it is generally more stable with respect to the step-size.
no code implementations • 14 Mar 2015 • Richard S. Sutton, A. Rupam Mahmood, Martha White
In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps.
no code implementations • NeurIPS 2014 • A. Rupam Mahmood, Hado P. Van Hasselt, Richard S. Sutton
Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(lambda).