1 code implementation • 3 Oct 2024 • Shreyas Chaudhari, Ameet Deshpande, Bruno Castro da Silva, Philip S. Thomas
Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving.
no code implementations • 23 Jun 2024 • Scott M. Jordan, Adam White, Bruno Castro da Silva, Martha White, Philip S. Thomas
Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms.
1 code implementation • 9 Jun 2024 • Kartik Choudhary, Dhawal Gupta, Philip S. Thomas
We present ICU-Sepsis, an environment that can be used in benchmarks for evaluating reinforcement learning (RL) algorithms.
no code implementations • 20 Dec 2023 • Dhawal Gupta, Scott M. Jordan, Shreyas Chaudhari, Bo Liu, Philip S. Thomas, Bruno Castro da Silva
In this paper, we introduce a fresh perspective on the challenges of credit assignment and policy evaluation.
1 code implementation • 23 Oct 2023 • Yuhong Luo, Austin Hoag, Philip S. Thomas
Representation learning is increasingly employed to generate representations that are predictive across multiple downstream tasks.
no code implementations • 16 May 2023 • James E. Kostas, Scott M. Jordan, Yash Chandak, Georgios Theocharous, Dhawal Gupta, Martha White, Bruno Castro da Silva, Philip S. Thomas
However, the coagent framework is not just an alternative to BDL; the two approaches can be blended: BDL can be combined with coagent learning rules to create architectures with the advantages of both approaches.
no code implementations • 6 Feb 2023 • Yash Chandak, Shiv Shankar, Venkata Gandikota, Philip S. Thomas, Arya Mazumdar
We propose a first-order method for convex optimization, where instead of being restricted to the gradient from a single parameter, gradients from multiple parameters can be used during each step of gradient descent.
1 code implementation • 24 Jan 2023 • Yash Chandak, Shiv Shankar, Nathaniel D. Bastian, Bruno Castro da Silva, Emma Brunskil, Philip S. Thomas
Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary.
1 code implementation • 7 Dec 2022 • David M. Bossens, Philip S. Thomas
In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy.
no code implementations • 24 Aug 2022 • Aline Weber, Blossom Metevier, Yuriy Brun, Philip S. Thomas, Bruno Castro da Silva
Recent research has shown that seemingly fair machine learning models, when used to inform decisions that have an impact on peoples' lives or well-being (e. g., applications involving education, employment, and lending), can inadvertently increase social inequality in the long term.
no code implementations • 6 Jun 2022 • Abhinav Bhatia, Philip S. Thomas, Shlomo Zilberstein
Model-based reinforcement learning promises to learn an optimal policy from fewer interactions with the environment compared to model-free reinforcement learning by learning an intermediate model of the environment in order to predict future interactions.
no code implementations • 10 Dec 2021 • James E. Kostas, Philip S. Thomas, Georgios Theocharous
In this work, we build on asynchronous coagent policy gradient algorithms \citep{kostas2020asynchronous} to propose a principled solution to this problem.
no code implementations • NeurIPS 2021 • Dhawal Gupta, Gabor Mihucz, Matthew Schlegel, James Kostas, Philip S. Thomas, Martha White
In this work, we revisit this approach and investigate if we can leverage other reinforcement learning approaches to improve learning.
1 code implementation • NeurIPS 2021 • Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, Scott Niekum
In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS.
no code implementations • ICLR 2022 • Stephen Giguere, Blossom Metevier, Yuriy Brun, Philip S. Thomas, Scott Niekum, Bruno Castro da Silva
Recent studies have demonstrated that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes.
no code implementations • NeurIPS 2021 • Harsh Satija, Philip S. Thomas, Joelle Pineau, Romain Laroche
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning (RL) setting.
1 code implementation • NeurIPS 2021 • Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip S. Thomas
When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy.
no code implementations • 25 Jan 2021 • Yash Chandak, Shiv Shankar, Philip S. Thomas
Many sequential decision-making systems leverage data collected using prior policies to propose a new policy.
no code implementations • NeurIPS 2020 • Pinar Ozisik, Philip S. Thomas
We analyze the extent to which existing methods rely on accurate training data for a specific class of reinforcement learning (RL) algorithms, known as Safe and Seldonian RL.
1 code implementation • NeurIPS 2020 • Yash Chandak, Scott M. Jordan, Georgios Theocharous, Martha White, Philip S. Thomas
Many real-world sequential decision-making problems involve critical systems with financial risks and human-life risks.
no code implementations • 15 Sep 2020 • Georgios Theocharous, Yash Chandak, Philip S. Thomas, Frits de Nijs
Strategic recommendations (SR) refer to the problem where an intelligent agent observes the sequential behaviors and activities of users and decides when and how to interact with them to optimize some long-term objectives, both for the user and the business.
1 code implementation • ICML 2020 • Scott M. Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, Philip S. Thomas
Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning.
1 code implementation • ICML 2020 • Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White, Sridhar Mahadevan, Philip S. Thomas
Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary.
no code implementations • 6 Jan 2020 • Francisco M. Garcia, Chris Nota, Philip S. Thomas
Reinforcement learning (RL) has become an increasingly active area of research in recent years.
1 code implementation • NeurIPS 2019 • Blossom Metevier, Stephen Giguere, Sarah Brockman, Ari Kobren, Yuriy Brun, Emma Brunskill, Philip S. Thomas
We present RobinHood, an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints.
1 code implementation • NeurIPS Workshop Neuro_AI 2019 • Sneha Aenugu, Abhishek Sharma, Sasikiran Yelamarthi, Hananel Hazan, Philip S. Thomas, Robert Kozma
Neuroscientific theory suggests that dopaminergic neurons broadcast global reward prediction errors to large areas of the brain influencing the synaptic plasticity of the neurons in those regions.
no code implementations • 17 Jun 2019 • Chris Nota, Philip S. Thomas
The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters.
no code implementations • 6 Jun 2019 • Philip S. Thomas, Scott M. Jordan, Yash Chandak, Chris Nota, James Kostas
We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective.
1 code implementation • 5 Jun 2019 • Yash Chandak, Georgios Theocharous, Blossom Metevier, Philip S. Thomas
The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not efficiently capture the setting where the set of available decisions (actions) at each time step is stochastic.
1 code implementation • 5 Jun 2019 • Yash Chandak, Georgios Theocharous, Chris Nota, Philip S. Thomas
have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed.
no code implementations • 15 May 2019 • Erik Learned-Miller, Philip S. Thomas
We present a new method for constructing a confidence interval for the mean of a bounded random variable from samples of the random variable.
no code implementations • ICML 2020 • James E. Kostas, Chris Nota, Philip S. Thomas
Coagent policy gradient algorithms (CPGAs) are reinforcement learning algorithms for training a class of stochastic neural networks called coagent networks.
Hierarchical Reinforcement Learning reinforcement-learning +2
1 code implementation • NeurIPS 2019 • Francisco M. Garcia, Philip S. Thomas
In this paper we consider the problem of how a reinforcement learning agent that is tasked with solving a sequence of reinforcement learning problems (a sequence of Markov decision processes) can use knowledge acquired early in its lifetime to improve its ability to solve new problems.
no code implementations • 1 Feb 2019 • Tengyang Xie, Philip S. Thomas, Gerome Miklau
Many reinforcement learning applications involve the use of data that is sensitive, such as medical records of patients or financial information.
no code implementations • 1 Feb 2019 • Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, Philip S. Thomas
Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori.
no code implementations • 4 Dec 2018 • Saket Tiwari, Philip S. Thomas
In this paper we show how the option-critic architecture can be extended to estimate the natural gradient of the expected discounted return.
no code implementations • 24 Nov 2017 • Francisco M. Garcia, Bruno C. da Silva, Philip S. Thomas
In this paper we consider the problem of how a reinforcement learning agent tasked with solving a set of related Markov decision processes can use knowledge acquired early in its lifetime to improve its ability to more rapidly solve novel, but related, tasks.
no code implementations • 17 Aug 2017 • Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Emma Brunskill
We propose a new framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesirable behaviors.
no code implementations • NeurIPS 1999 • Philip S. Thomas, Emma Brunskill
We show how an action-dependent baseline can be used by the policy gradient theorem using function approximation, originally presented with action-independent baselines by (Sutton et al. 2000).
1 code implementation • ICML 2017 • Josiah P. Hanna, Philip S. Thomas, Peter Stone, Scott Niekum
The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance.
no code implementations • 9 Jun 2017 • Philip S. Thomas, Christoph Dann, Emma Brunskill
When creating an artificial intelligence system, we must make two decisions: what representation should be used (i. e., what parameterized function should be used) and what learning rule should be used to search through the resulting set of representable functions.
no code implementations • NeurIPS 2017 • Zhaohan Daniel Guo, Philip S. Thomas, Emma Brunskill
In addition, we can take advantage of special cases that arise due to options-based policies to further improve the performance of importance sampling.
no code implementations • 10 Nov 2016 • Philip S. Thomas, Emma Brunskill
Importance sampling is often used in machine learning when training and testing data come from different distributions.
3 code implementations • 4 Apr 2016 • Philip S. Thomas, Emma Brunskill
In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy.
1 code implementation • 30 Dec 2015 • Philip S. Thomas, Billy Okal
This paper specifies a notation for Markov decision processes.
2 code implementations • 15 Dec 2015 • Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, Rémi Munos
Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator.
Ranked #1 on Atari Games on Atari 2600 Elevator Action
no code implementations • NeurIPS 2015 • Philip S. Thomas, Scott Niekum, Georgios Theocharous, George Konidaris
The benefit of the Ω-return is that it accounts for the correlation of different length returns.
no code implementations • NeurIPS 2013 • Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan
Natural actor-critics are a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes.
no code implementations • NeurIPS 2011 • George Konidaris, Scott Niekum, Philip S. Thomas
We show that the lambda-return target used in the TD(lambda) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an n-step return estimate increases with n. We introduce the gamma-return estimator, an alternative target based on a more accurate model of variance, which defines the TD_gamma family of complex-backup temporal difference learning algorithms.
no code implementations • NeurIPS 2011 • Philip S. Thomas
We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules.