no code implementations • 11 Jul 2023 • Sanae Amani, Khushbu Pahwa, Vladimir Braverman, Lin F. Yang
Our research demonstrates that to achieve $\epsilon$-optimal policies for all $M$ tasks, a single agent using DistMT-LSVI needs to run a total number of episodes that is at most $\tilde{\mathcal{O}}({d^3H^6(\epsilon^{-2}+c_{\rm sep}^{-2})}\cdot M/N)$, where $c_{\rm sep}>0$ is a constant representing task separability, $H$ is the horizon of each episode, and $d$ is the feature dimension of the dynamics and rewards.
no code implementations • 1 Jun 2022 • Sanae Amani, Lin F. Yang, Ching-An Cheng
We study lifelong reinforcement learning (RL) in a regret minimization setting of linear contextual Markov decision process (MDP), where the agent needs to learn a multi-task policy while solving a streaming sequence of tasks.
no code implementations • 26 May 2022 • Sanae Amani, Tor Lattimore, András György, Lin F. Yang
In particular, for scenarios with known context distribution, the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is ${\tilde{\mathcal{O}}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds.
no code implementations • 11 Jun 2021 • Sanae Amani, Christos Thrampoulidis, Lin F. Yang
Safety in reinforcement learning has become increasingly important in recent years.
no code implementations • NeurIPS 2021 • Sanae Amani, Christos Thrampoulidis
Out of the rich family of generalized linear bandits, perhaps the most well studied ones are logisitc bandits that are used in problems with binary rewards: for instance, when the learner/agent tries to maximize the profit over a user that can select one of two possible outcomes (e. g., `click' vs `no-click').
no code implementations • 1 Dec 2020 • Sanae Amani, Christos Thrampoulidis
For this problem, we propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network.
no code implementations • L4DC 2020 • Sanae Amani, Mahnoosh Alizadeh, Christos Thrampoulidis
Many applications require a learner to make sequential decisions given uncertainty regarding both the system’s payoff function and safety constraints.
no code implementations • 5 May 2020 • Sanae Amani, Mahnoosh Alizadeh, Christos Thrampoulidis
Many applications require a learner to make sequential decisions given uncertainty regarding both the system's payoff function and safety constraints.
no code implementations • 6 Nov 2019 • Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, Christos Thrampoulidis
We compare the performance of our algorithm with UCB-based safe algorithms and highlight how the inherently randomized nature of TS leads to a superior performance in expanding the set of safe actions the algorithm has access to at each round.
no code implementations • NeurIPS 2019 • Sanae Amani, Mahnoosh Alizadeh, Christos Thrampoulidis
During the pure exploration phase the learner chooses her actions at random from a restricted set of safe actions with the goal of learning a good approximation of the entire unknown safe set.