no code implementations • ICLR 2019 • Honghua Dong, Jiayuan Mao, Xinyue Cui, Lihong Li
In this paper, we advocate the use of explicit memory for efficient exploration in reinforcement learning.
no code implementations • 29 Dec 2022 • Riashat Islam, Samarth Sinha, Homanga Bharadhwaj, Samin Yeasar Arnob, Zhuoran Yang, Animesh Garg, Zhaoran Wang, Lihong Li, Doina Precup
Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications.
no code implementations • 14 Oct 2022 • Ziyang Tang, Yiheng Duan, Stephanie Zhang, Lihong Li
Randomized experiments (a. k. a.
no code implementations • ICLR 2022 • Xiaoyu Chen, Jiachen Hu, Chi Jin, Lihong Li, LiWei Wang
Reinforcement learning encounters many challenges when applied directly in the real world.
no code implementations • 1 Jul 2021 • Yi Liu, Lihong Li
The rich body of Bandit literature not only offers a diverse toolbox of algorithms, but also makes it hard for a practitioner to find the right solution to solve the problem at hand.
no code implementations • 6 Apr 2021 • Chenjun Xiao, Yifan Wu, Tor Lattimore, Bo Dai, Jincheng Mei, Lihong Li, Csaba Szepesvari, Dale Schuurmans
First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis.
no code implementations • 8 Feb 2021 • Jiachen Hu, Xiaoyu Chen, Chi Jin, Lihong Li, LiWei Wang
This paper studies representation learning for multi-task linear bandits and multi-task episodic RL with linear value function approximation.
no code implementations • 1 Jan 2021 • Riashat Islam, Samarth Sinha, Homanga Bharadhwaj, Samin Yeasar Arnob, Zhuoran Yang, Zhaoran Wang, Animesh Garg, Lihong Li, Doina Precup
Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications.
no code implementations • NeurIPS 2020 • Jincheng Mei, Chenjun Xiao, Bo Dai, Lihong Li, Csaba Szepesvari, Dale Schuurmans
Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities.
no code implementations • NeurIPS 2020 • Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, Dale Schuurmans
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies.
3 code implementations • ICLR 2021 • Weitong Zhang, Dongruo Zhou, Lihong Li, Quanquan Gu
Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems.
no code implementations • ICLR 2021 • Xiaoyu Chen, Jiachen Hu, Lihong Li, Li-Wei Wang
The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factored of $\sqrt{H|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the cardinality of the factored state subspace and $H$ is the planning horizon.
no code implementations • 27 Jul 2020 • Andrew Bennett, Nathan Kallus, Lihong Li, Ali Mousavi
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders, where states and actions can act as proxies for the unobserved confounders.
no code implementations • NeurIPS 2020 • Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans
The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data.
no code implementations • ICLR 2020 • Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou
Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible.
1 code implementation • ICML 2020 • Junfeng Wen, Bo Dai, Lihong Li, Dale Schuurmans
We consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions.
1 code implementation • ICLR 2020 • Ruiyi Zhang, Bo Dai, Lihong Li, Dale Schuurmans
An important problem that arises in reinforcement learning and Monte Carlo methods is estimating quantities defined by the stationary distribution of a Markov chain.
no code implementations • 12 Feb 2020 • Ge Liu, Rui Wu, Heng-Tze Cheng, Jing Wang, Jayden Ooi, Lihong Li, Ang Li, Wai Lok Sibon Li, Craig Boutilier, Ed Chi
Deep Reinforcement Learning (RL) is proven powerful for decision making in simulated environments.
no code implementations • 4 Dec 2019 • Ofir Nachum, Bo Dai, Ilya Kostrikov, Yin-Lam Chow, Lihong Li, Dale Schuurmans
In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility.
4 code implementations • ICML 2020 • Dongruo Zhou, Lihong Li, Quanquan Gu
To the best of our knowledge, it is the first neural network-based contextual bandit algorithm with a near-optimal regret guarantee.
no code implementations • ICLR 2020 • Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu
Our method is doubly robust in that the bias vanishes when either the density ratio or the value function estimation is perfect.
no code implementations • 25 Sep 2019 • Dongruo Zhou, Lihong Li, Quanquan Gu
To the best of our knowledge, our algorithm is the first neural network-based contextual bandit algorithm with near-optimal regret guarantee.
no code implementations • 21 Jun 2019 • Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, Craig Boutilier
GLM-TSL samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution.
2 code implementations • NeurIPS 2019 • Ofir Nachum, Yin-Lam Chow, Bo Dai, Lihong Li
In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset.
1 code implementation • NeurIPS 2019 • Yihao Feng, Lihong Li, Qiang Liu
Value function learning plays a central role in many state-of-the-art reinforcement-learning algorithms.
2 code implementations • ICLR 2019 • Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, Denny Zhou
We propose the Neural Logic Machine (NLM), a neural-symbolic architecture for both inductive learning and logic reasoning.
no code implementations • 7 Nov 2018 • Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill
The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration.
2 code implementations • NeurIPS 2018 • Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou
We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy.
no code implementations • NeurIPS 2018 • Kwang-Sung Jun, Lihong Li, Yuzhe ma, Xiaojin Zhu
We study adversarial attacks that manipulate the reward signals to control the actions chosen by a stochastic multi-armed bandit algorithm.
no code implementations • ACL 2018 • Jianfeng Gao, Michel Galley, Lihong Li
The present paper surveys neural approaches to conversational AI that have been developed in the last few years.
no code implementations • 17 Aug 2018 • Yuzhe Ma, Kwang-Sung Jun, Lihong Li, Xiaojin Zhu
We provide a general attack framework based on convex optimization and show that by slightly manipulating rewards in the data, an attacker can force the bandit algorithm to pull a target arm for a target contextual vector.
no code implementations • ICML 2018 • Yi-Chen Chen, Lihong Li, Mengdi Wang
In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear $\pi$ learning for reinforcement learning when a sampling oracle is provided.
no code implementations • 27 Apr 2018 • Yi-Chen Chen, Lihong Li, Mengdi Wang
In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear $\pi$ learning for reinforcement learning when a sampling oracle is provided.
no code implementations • EMNLP 2018 • Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, Tony Jebara
Experiments with simulated and real users show that our approach performs competitively against a state-of-the-art method that requires human-defined subgoals.
no code implementations • ICLR 2018 • Ricky Loynd, Matthew Hausknecht, Lihong Li, Li Deng
Humans rely on episodic memory constantly, in remembering the name of someone they met 10 minutes ago, the plot of a movie as it unfolds, or where they parked the car.
no code implementations • ICLR 2018 • Zachary C. Lipton, Kamyar Azizzadenesheli, Abhishek Kumar, Lihong Li, Jianfeng Gao, Li Deng
Many practical reinforcement learning problems contain catastrophic states that the optimal policy visits infrequently or never.
no code implementations • ICML 2018 • Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, Le Song
When function approximation is used, solving the Bellman optimality equation with stability guarantees has remained a major open problem in reinforcement learning for decades.
no code implementations • ICLR 2018 • Bo Dai, Albert Shaw, Niao He, Lihong Li, Le Song
This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC.
no code implementations • NeurIPS 2017 • Jianshu Chen, Chong Wang, Lin Xiao, Ji He, Lihong Li, Li Deng
In sequential decision making, it is often important and useful for end users to understand the underlying patterns or causes that lead to the corresponding decisions.
no code implementations • 15 Nov 2017 • Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, Li Deng
We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems.
no code implementations • EMNLP 2017 • Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, Kam-Fai Wong
Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks.
no code implementations • 21 Mar 2017 • Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, Asli Celikyilmaz
Language understanding is a key component in a spoken dialogue system.
13 code implementations • IJCNLP 2017 • Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, Asli Celikyilmaz
One of the major drawbacks of modularized task-completion dialogue systems is that each module is trained individually, which presents several challenges.
no code implementations • 28 Feb 2017 • Asli Celikyilmaz, Li Deng, Lihong Li, Chong Wang
We introduce a new paradigm of learning for reasoning, understanding, and prediction, as well as the scaffolding network to implement this paradigm.
no code implementations • ICML 2017 • Lihong Li, Yu Lu, Dengyong Zhou
Contextual bandits are widely used in Internet services from news recommendation to advertising, and to Web search.
no code implementations • ICML 2017 • Simon S. Du, Jianshu Chen, Lihong Li, Lin Xiao, Dengyong Zhou
Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy.
10 code implementations • 17 Dec 2016 • Xiujun Li, Zachary C. Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, Yun-Nung Chen
Then, one can train reinforcement learning agents in an online fashion as they interact with the simulator.
no code implementations • NeurIPS 2016 • Tzu-Kuo Huang, Lihong Li, Ara Vartanian, Saleema Amershi, Jerry Zhu
We present a theoretical analysis of active learning with more realistic interactions with human oracles.
no code implementations • 6 Nov 2016 • Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, Pushmeet Kohli
While achieving impressive results, these approaches have a number of important limitations: (a) they are computationally expensive and hard to train, (b) a model has to be trained for each task (program) separately, and (c) it is hard to interpret or verify the correctness of the learnt mapping (as it is defined by a neural network).
no code implementations • 3 Nov 2016 • Zachary C. Lipton, Kamyar Azizzadenesheli, Abhishek Kumar, Lihong Li, Jianfeng Gao, Li Deng
We introduce intrinsic fear (IF), a learned reward shaping that guards DRL agents against periodic catastrophes.
1 code implementation • ACL 2017 • Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, Li Deng
In this paper, we address this limitation by replacing symbolic queries with an induced "soft" posterior distribution over the KB that indicates which entities the user is interested in.
no code implementations • 17 Aug 2016 • Zachary C. Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, Li Deng
We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems.
1 code implementation • EMNLP 2016 • Ji He, Mari Ostendorf, Xiaodong He, Jianshu Chen, Jianfeng Gao, Lihong Li, Li Deng
We introduce an online popularity prediction and tracking task as a benchmark task for reinforcement learning with a combinatorial, natural language action space.
3 code implementations • ACL 2016 • Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, Mari Ostendorf
This paper introduces a novel architecture for reinforcement learning with deep neural networks designed to handle state and action spaces characterized by natural language, as found in text-based games.
2 code implementations • 11 Nov 2015 • Nan Jiang, Lihong Li
We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy.
no code implementations • 10 Sep 2015 • Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, Ji He
Successful applications of reinforcement learning in real-world problems often require dealing with partially observable states.
no code implementations • 10 Jun 2015 • Emma Brunskill, Lihong Li
Transferring knowledge across a sequence of related tasks is an important challenge in reinforcement learning (RL).
no code implementations • 10 Jun 2015 • Shipra Agrawal, Nikhil R. Devanur, Lihong Li
This problem was introduced by Badanidiyuru et al. (2014), who gave a computationally inefficient algorithm with near-optimal regret bounds for it.
no code implementations • 10 Jun 2015 • Che-Yu Liu, Lihong Li
The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties.
no code implementations • 28 Apr 2015 • Dragomir Yankov, Pavel Berkhin, Lihong Li
An offline framework is thus necessary to let us decide what policy and how we should apply in a production environment to ensure positive outcome.
no code implementations • 10 Mar 2015 • Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li
As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.
no code implementations • 12 Sep 2014 • Lihong Li, Remi Munos, Csaba Szepesvari
This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.
no code implementations • 7 Mar 2014 • Lihong Li, Shunbao Chen, Jim Kleban, Ankur Gupta
Optimizing an interactive system against a predefined online metric is particularly challenging, when the metric is computed from user feedback such as clicks and payments.
1 code implementation • 4 Feb 2014 • Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert E. Schapire
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of $K$ actions in response to the observed context, and observes the reward only for that chosen action.
no code implementations • 18 Dec 2013 • Zhen Qin, Vaclav Petricek, Nikos Karampatziakis, Lihong Li, John Langford
Bootstrapping is a useful technique for estimating the uncertainty of a predictor, for example, confidence intervals for prediction.
no code implementations • 27 Oct 2013 • Lihong Li
Similar to most expert-learning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights.
no code implementations • 26 Sep 2013 • Emma Brunskill, Lihong Li
Transferring knowledge across a sequence of reinforcement-learning tasks is challenging, and has a number of important applications.
no code implementations • NeurIPS 2011 • Olivier Chapelle, Lihong Li
Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly not very popular in the literature.
1 code implementation • 23 Mar 2011 • Miroslav Dudik, John Langford, Lihong Li
The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy.
no code implementations • NeurIPS 2010 • Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola
With the increase in available data parallel machine learning has become an increasingly pressing problem.
4 code implementations • 31 Mar 2010 • Lihong Li, Wei Chu, John Langford, Xuanhui Wang
\emph{Offline} evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature.
12 code implementations • 28 Feb 2010 • Lihong Li, Wei Chu, John Langford, Robert E. Schapire
In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks.
no code implementations • NeurIPS 2010 • Alex Strehl, John Langford, Sham Kakade, Lihong Li
We provide a sound and consistent foundation for the use of \emph{nonrandom} exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learned.
no code implementations • NeurIPS 2008 • John Langford, Lihong Li, Tong Zhang
We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss.