Search Results for author: Lihong Li

Found 70 papers, 17 papers with code

Explicit Recall for Efficient Exploration

no code implementations ICLR 2019 Honghua Dong, Jiayuan Mao, Xinyue Cui, Lihong Li

In this paper, we advocate the use of explicit memory for efficient exploration in reinforcement learning.

Decision Making Efficient Exploration

A Map of Bandits for E-commerce

no code implementations1 Jul 2021 Yi Liu, Lihong Li

The rich body of Bandit literature not only offers a diverse toolbox of algorithms, but also makes it hard for a practitioner to find the right solution to solve the problem at hand.

On the Optimality of Batch Policy Optimization Algorithms

no code implementations6 Apr 2021 Chenjun Xiao, Yifan Wu, Tor Lattimore, Bo Dai, Jincheng Mei, Lihong Li, Csaba Szepesvari, Dale Schuurmans

First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis.

Value prediction

Near-optimal Representation Learning for Linear Bandits and Linear RL

no code implementations8 Feb 2021 Jiachen Hu, Xiaoyu Chen, Chi Jin, Lihong Li, LiWei Wang

This paper studies representation learning for multi-task linear bandits and multi-task episodic RL with linear value function approximation.

Representation Learning

Offline Policy Optimization with Variance Regularization

no code implementations1 Jan 2021 Riashat Islam, Samarth Sinha, Homanga Bharadhwaj, Samin Yeasar Arnob, Zhuoran Yang, Zhaoran Wang, Animesh Garg, Lihong Li, Doina Precup

Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications.

Continuous Control Offline RL

Escaping the Gravitational Pull of Softmax

no code implementations NeurIPS 2020 Jincheng Mei, Chenjun Xiao, Bo Dai, Lihong Li, Csaba Szepesvari, Dale Schuurmans

Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities.

CoinDICE: Off-Policy Confidence Interval Estimation

no code implementations NeurIPS 2020 Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, Dale Schuurmans

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies.

Neural Thompson Sampling

1 code implementation ICLR 2021 Weitong Zhang, Dongruo Zhou, Lihong Li, Quanquan Gu

Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems.

Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL

no code implementations ICLR 2021 Xiaoyu Chen, Jiachen Hu, Lihong Li, Li-Wei Wang

The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factored of $\sqrt{H|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the cardinality of the factored state subspace and $H$ is the planning horizon.

Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

no code implementations27 Jul 2020 Andrew Bennett, Nathan Kallus, Lihong Li, Ali Mousavi

We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders, where states and actions can act as proxies for the unobserved confounders.

Off-Policy Evaluation via the Regularized Lagrangian

no code implementations NeurIPS 2020 Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans

The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data.

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

no code implementations ICLR 2020 Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou

Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible.

Batch Stationary Distribution Estimation

1 code implementation ICML 2020 Junfeng Wen, Bo Dai, Lihong Li, Dale Schuurmans

We consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions.

GenDICE: Generalized Offline Estimation of Stationary Values

2 code implementations ICLR 2020 Ruiyi Zhang, Bo Dai, Lihong Li, Dale Schuurmans

An important problem that arises in reinforcement learning and Monte Carlo methods is estimating quantities defined by the stationary distribution of a Markov chain.

AlgaeDICE: Policy Gradient from Arbitrary Experience

no code implementations4 Dec 2019 Ofir Nachum, Bo Dai, Ilya Kostrikov, Yin-Lam Chow, Lihong Li, Dale Schuurmans

In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility.

Neural Contextual Bandits with UCB-based Exploration

1 code implementation ICML 2020 Dongruo Zhou, Lihong Li, Quanquan Gu

To the best of our knowledge, it is the first neural network-based contextual bandit algorithm with a near-optimal regret guarantee.

Efficient Exploration Multi-Armed Bandits

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

no code implementations ICLR 2020 Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, Qiang Liu

Our method is doubly robust in that the bias vanishes when either the density ratio or the value function estimation is perfect.

Density Ratio Estimation

Randomized Exploration in Generalized Linear Bandits

no code implementations21 Jun 2019 Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, Craig Boutilier

GLM-TSL samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution.

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

2 code implementations NeurIPS 2019 Ofir Nachum, Yin-Lam Chow, Bo Dai, Lihong Li

In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset.

A Kernel Loss for Solving the Bellman Equation

1 code implementation NeurIPS 2019 Yihao Feng, Lihong Li, Qiang Liu

Value function learning plays a central role in many state-of-the-art reinforcement-learning algorithms.

Q-Learning

Neural Logic Machines

1 code implementation ICLR 2019 Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, Denny Zhou

We propose the Neural Logic Machine (NLM), a neural-symbolic architecture for both inductive learning and logic reasoning.

Decision Making Inductive logic programming +1

Policy Certificates: Towards Accountable Reinforcement Learning

no code implementations7 Nov 2018 Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration.

Adversarial Attacks on Stochastic Bandits

no code implementations NeurIPS 2018 Kwang-Sung Jun, Lihong Li, Yuzhe ma, Xiaojin Zhu

We study adversarial attacks that manipulate the reward signals to control the actions chosen by a stochastic multi-armed bandit algorithm.

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

2 code implementations NeurIPS 2018 Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy.

Neural Approaches to Conversational AI

no code implementations ACL 2018 Jianfeng Gao, Michel Galley, Lihong Li

The present paper surveys neural approaches to conversational AI that have been developed in the last few years.

Question Answering

Data Poisoning Attacks in Contextual Bandits

no code implementations17 Aug 2018 Yuzhe Ma, Kwang-Sung Jun, Lihong Li, Xiaojin Zhu

We provide a general attack framework based on convex optimization and show that by slightly manipulating rewards in the data, an attacker can force the bandit algorithm to pull a target arm for a target contextual vector.

Data Poisoning Multi-Armed Bandits

Scalable Bilinear Pi Learning Using State and Action Features

no code implementations ICML 2018 Yi-Chen Chen, Lihong Li, Mengdi Wang

In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear $\pi$ learning for reinforcement learning when a sampling oracle is provided.

Scalable Bilinear $π$ Learning Using State and Action Features

no code implementations27 Apr 2018 Yi-Chen Chen, Lihong Li, Mengdi Wang

In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear $\pi$ learning for reinforcement learning when a sampling oracle is provided.

Subgoal Discovery for Hierarchical Dialogue Policy Learning

no code implementations EMNLP 2018 Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, Tony Jebara

Experiments with simulated and real users show that our approach performs competitively against a state-of-the-art method that requires human-defined subgoals.

Hierarchical Reinforcement Learning

Now I Remember! Episodic Memory For Reinforcement Learning

no code implementations ICLR 2018 Ricky Loynd, Matthew Hausknecht, Lihong Li, Li Deng

Humans rely on episodic memory constantly, in remembering the name of someone they met 10 minutes ago, the plot of a movie as it unfolds, or where they parked the car.

Boosting the Actor with Dual Critic

no code implementations ICLR 2018 Bo Dai, Albert Shaw, Niao He, Lihong Li, Le Song

This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC.

SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation

no code implementations ICML 2018 Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, Le Song

When function approximation is used, solving the Bellman optimality equation with stability guarantees has remained a major open problem in reinforcement learning for decades.

Q-Learning

Q-LDA: Uncovering Latent Patterns in Text-based Sequential Decision Processes

no code implementations NeurIPS 2017 Jianshu Chen, Chong Wang, Lin Xiao, Ji He, Lihong Li, Li Deng

In sequential decision making, it is often important and useful for end users to understand the underlying patterns or causes that lead to the corresponding decisions.

Decision Making Q-Learning +1

BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems

no code implementations15 Nov 2017 Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, Li Deng

We present a new algorithm that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems.

Efficient Exploration Q-Learning +1

Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning

no code implementations EMNLP 2017 Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, Kam-Fai Wong

Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks.

Task-Completion Dialogue Policy Learning

End-to-End Task-Completion Neural Dialogue Systems

13 code implementations IJCNLP 2017 Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, Asli Celikyilmaz

One of the major drawbacks of modularized task-completion dialogue systems is that each module is trained individually, which presents several challenges.

Chatbot

Provably Optimal Algorithms for Generalized Linear Contextual Bandits

no code implementations ICML 2017 Lihong Li, Yu Lu, Dengyong Zhou

Contextual bandits are widely used in Internet services from news recommendation to advertising, and to Web search.

Multi-Armed Bandits News Recommendation

Scaffolding Networks: Incremental Learning and Teaching Through Questioning

no code implementations28 Feb 2017 Asli Celikyilmaz, Li Deng, Lihong Li, Chong Wang

We introduce a new paradigm of learning for reasoning, understanding, and prediction, as well as the scaffolding network to implement this paradigm.

Incremental Learning

Stochastic Variance Reduction Methods for Policy Evaluation

no code implementations ICML 2017 Simon S. Du, Jianshu Chen, Lihong Li, Lin Xiao, Dengyong Zhou

Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy.

A User Simulator for Task-Completion Dialogues

10 code implementations17 Dec 2016 Xiujun Li, Zachary C. Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, Yun-Nung Chen

Then, one can train reinforcement learning agents in an online fashion as they interact with the simulator.

Task-Oriented Dialogue Systems

Active Learning with Oracle Epiphany

no code implementations NeurIPS 2016 Tzu-Kuo Huang, Lihong Li, Ara Vartanian, Saleema Amershi, Jerry Zhu

We present a theoretical analysis of active learning with more realistic interactions with human oracles.

Active Learning

Neuro-Symbolic Program Synthesis

no code implementations6 Nov 2016 Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, Pushmeet Kohli

While achieving impressive results, these approaches have a number of important limitations: (a) they are computationally expensive and hard to train, (b) a model has to be trained for each task (program) separately, and (c) it is hard to interpret or verify the correctness of the learnt mapping (as it is defined by a neural network).

Program induction Program Synthesis

Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access

1 code implementation ACL 2017 Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, Li Deng

In this paper, we address this limitation by replacing symbolic queries with an induced "soft" posterior distribution over the KB that indicates which entities the user is interested in.

Deep Reinforcement Learning with a Combinatorial Action Space for Predicting Popular Reddit Threads

1 code implementation EMNLP 2016 Ji He, Mari Ostendorf, Xiaodong He, Jianshu Chen, Jianfeng Gao, Lihong Li, Li Deng

We introduce an online popularity prediction and tracking task as a benchmark task for reinforcement learning with a combinatorial, natural language action space.

Deep Reinforcement Learning with a Natural Language Action Space

3 code implementations ACL 2016 Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, Mari Ostendorf

This paper introduces a novel architecture for reinforcement learning with deep neural networks designed to handle state and action spaces characterized by natural language, as found in text-based games.

Q-Learning text-based games

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

no code implementations11 Nov 2015 Nan Jiang, Lihong Li

We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy.

Decision Making

Recurrent Reinforcement Learning: A Hybrid Approach

no code implementations10 Sep 2015 Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, Ji He

Successful applications of reinforcement learning in real-world problems often require dealing with partially observable states.

An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

no code implementations10 Jun 2015 Shipra Agrawal, Nikhil R. Devanur, Lihong Li

This problem was introduced by Badanidiyuru et al. (2014), who gave a computationally inefficient algorithm with near-optimal regret bounds for it.

Multi-Armed Bandits

The Online Coupon-Collector Problem and Its Application to Lifelong Reinforcement Learning

no code implementations10 Jun 2015 Emma Brunskill, Lihong Li

Transferring knowledge across a sequence of related tasks is an important challenge in reinforcement learning (RL).

Human robot interaction

On the Prior Sensitivity of Thompson Sampling

no code implementations10 Jun 2015 Che-Yu Liu, Lihong Li

The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties.

Evaluation of Explore-Exploit Policies in Multi-result Ranking Systems

no code implementations28 Apr 2015 Dragomir Yankov, Pavel Berkhin, Lihong Li

An offline framework is thus necessary to let us decide what policy and how we should apply in a production environment to ensure positive outcome.

News Recommendation

Doubly Robust Policy Evaluation and Optimization

no code implementations10 Mar 2015 Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li

As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.

Decision Making Multi-Armed Bandits

On Minimax Optimal Offline Policy Evaluation

no code implementations12 Sep 2014 Lihong Li, Remi Munos, Csaba Szepesvari

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy.

Multi-Armed Bandits

Counterfactual Estimation and Optimization of Click Metrics for Search Engines

no code implementations7 Mar 2014 Lihong Li, Shunbao Chen, Jim Kleban, Ankur Gupta

Optimizing an interactive system against a predefined online metric is particularly challenging, when the metric is computed from user feedback such as clicks and payments.

Causal Inference

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

1 code implementation4 Feb 2014 Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert E. Schapire

We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of $K$ actions in response to the observed context, and observes the reward only for that chosen action.

General Classification Multi-Armed Bandits

Efficient Online Bootstrapping for Large Scale Learning

no code implementations18 Dec 2013 Zhen Qin, Vaclav Petricek, Nikos Karampatziakis, Lihong Li, John Langford

Bootstrapping is a useful technique for estimating the uncertainty of a predictor, for example, confidence intervals for prediction.

Generalized Thompson Sampling for Contextual Bandits

no code implementations27 Oct 2013 Lihong Li

Similar to most expert-learning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights.

Multi-Armed Bandits

Sample Complexity of Multi-task Reinforcement Learning

no code implementations26 Sep 2013 Emma Brunskill, Lihong Li

Transferring knowledge across a sequence of reinforcement-learning tasks is challenging, and has a number of important applications.

An Empirical Evaluation of Thompson Sampling

no code implementations NeurIPS 2011 Olivier Chapelle, Lihong Li

Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly not very popular in the literature.

Doubly Robust Policy Evaluation and Learning

1 code implementation23 Mar 2011 Miroslav Dudik, John Langford, Lihong Li

The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy.

Decision Making Multi-Armed Bandits

Parallelized Stochastic Gradient Descent

no code implementations NeurIPS 2010 Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola

With the increase in available data parallel machine learning has become an increasingly pressing problem.

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

3 code implementations31 Mar 2010 Lihong Li, Wei Chu, John Langford, Xuanhui Wang

\emph{Offline} evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature.

News Recommendation Recommendation Systems

A Contextual-Bandit Approach to Personalized News Article Recommendation

8 code implementations28 Feb 2010 Lihong Li, Wei Chu, John Langford, Robert E. Schapire

In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks.

Learning Theory

Learning from Logged Implicit Exploration Data

no code implementations NeurIPS 2010 Alex Strehl, John Langford, Sham Kakade, Lihong Li

We provide a sound and consistent foundation for the use of \emph{nonrandom} exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learned.

Sparse Online Learning via Truncated Gradient

no code implementations NeurIPS 2008 John Langford, Lihong Li, Tong Zhang

We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss.

Cannot find the paper you are looking for? You can Submit a new open access paper.