no code implementations • ICML 2020 • Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, Ole J. Mengshoel
We experiment with three structured bandit problems: cascading bandits, online learning to rank in the position-based model, and rank-1 bandits.
no code implementations • ICML 2020 • Pierre Perrault, Zheng Wen, Michal Valko, Jennifer Healey
We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set.
no code implementations • 31 Jan 2025 • Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen
We address the problem of best policy identification in preference-based reinforcement learning (PbRL), where learning occurs from noisy binary preferences over trajectory pairs rather than explicit numerical rewards.
no code implementations • 11 Nov 2024 • Xinqi Yang, Scott Zang, Yong Ren, Dingjie Peng, Zheng Wen
Our dataset is available on huggingface.
no code implementations • 13 Jun 2024 • Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen
We propose $\mathsf{warmPref-PS}$, a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback.
no code implementations • 2 Dec 2023 • Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy
Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA).
no code implementations • 17 Oct 2023 • Dengwang Tang, Rahul Jain, Botao Hao, Zheng Wen
In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with.
no code implementations • 18 Jul 2023 • Xingyue Ma, Hongying Chen, Ri He, Zhanbo Yu, Sergei Prokhorenko, Zheng Wen, Zhicheng Zhong, Jorge Iñiguez, L. Bellaiche, Di wu, Yurong Yang
The first-principles-based effective Hamiltonian scheme provides one of the most accurate modeling technique for large-scale structures, especially for ferroelectrics.
no code implementations • 20 Mar 2023 • Botao Hao, Rahul Jain, Dengwang Tang, Zheng Wen
We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset.
1 code implementation • 18 Feb 2023 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy
Further, we demonstrate that the \textit{epinet} -- a small additive network that estimates uncertainty -- matches the performance of large ensembles at orders of magnitude lower computational cost.
no code implementations • 7 Feb 2023 • Botao Hao, Rahul Jain, Tor Lattimore, Benjamin Van Roy, Zheng Wen
This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level.
no code implementations • 15 Dec 2022 • Seksan Kiatsupaibul, Pakawan Chansiripas, Pojtanut Manopanjasiri, Kantapong Visantavarakul, Zheng Wen
This paper proposes a novel reinforcement learning (RL) framework for credit underwriting that tackles ungeneralizable contextual challenges.
no code implementations • 1 Jul 2022 • Xiuyuan Lu, Ian Osband, Seyed Mohammad Asghari, Sven Gowal, Vikranth Dwaracherla, Zheng Wen, Benjamin Van Roy
However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning.
no code implementations • 8 Jun 2022 • Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy
In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions.
no code implementations • 2 Mar 2022 • Chao Qin, Zheng Wen, Xiuyuan Lu, Benjamin Van Roy
Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable.
1 code implementation • 28 Feb 2022 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, Benjamin Van Roy
Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i. i. d.
2 code implementations • 9 Oct 2021 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy
Predictive distributions quantify uncertainties ignored by point estimates.
no code implementations • 29 Sep 2021 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Dieterich Lawson, Brendan O'Donoghue, Botao Hao, Benjamin Van Roy
This paper introduces \textit{The Neural Testbed}, which provides tools for the systematic evaluation of agents that generate such predictions.
no code implementations • 20 Jul 2021 • Zheng Wen, Ian Osband, Chao Qin, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Mohammad Asghari, Benjamin Van Roy
A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes?
1 code implementation • NeurIPS 2023 • Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy
We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty.
no code implementations • 20 Apr 2021 • Alfonso Lobos, Paul Grigas, Zheng Wen
We consider an online revenue maximization problem over a finite time horizon subject to lower and upper bounds on cost.
no code implementations • 6 Mar 2021 • Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen
To illustrate concepts, we design simple agents that build on them and present computational results that highlight data efficiency.
no code implementations • 5 Jan 2021 • Pierre Perrault, Jennifer Healey, Zheng Wen, Michal Valko
We demonstrate that from an algorithm guaranteeing an approximation factor for the ratio of submodular (RS) optimization problem, we can build another algorithm having a different kind of approximation guarantee -- weaker than the classical one -- for the difference of submodular (DS) optimization problem, and vice versa.
Data Structures and Algorithms
no code implementations • NeurIPS 2021 • Pan Xu, Zheng Wen, Handong Zhao, Quanquan Gu
We study a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the reward generating function is unknown.
no code implementations • NeurIPS 2020 • Zheng Wen, Doina Precup, Morteza Ibrahimi, Andre Barreto, Benjamin Van Roy, Satinder Singh
Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, both in terms of statistical as well as computational efficiency.
no code implementations • 5 Oct 2020 • Jing Shi, Ning Xu, Trung Bui, Franck Dernoncourt, Zheng Wen, Chenliang Xu
To solve this new task, we first present a new language-driven image editing dataset that supports both local and global editing with editing operation and mask annotations.
no code implementations • 17 Aug 2020 • Wenlong Mou, Zheng Wen, Xi Chen
To avoid such undesirable dependence on the state and action space sizes, this paper proposes a new notion of eluder dimension for the policy space, which characterizes the intrinsic complexity of policy learning in an arbitrary Markov Decision Process (MDP).
1 code implementation • 31 Jul 2020 • Jie zhou, Botao Hao, Zheng Wen, Jingfei Zhang, Will Wei Sun
We consider two settings, tensor bandits without context and tensor bandits with context.
no code implementations • ICML 2020 • Youngsuk Park, Ryan A. Rossi, Zheng Wen, Gang Wu, Handong Zhao
In this paper, we introduce the \textit{Structured Policy Iteration} (S-PI) for LQR, a method capable of deriving a structured linear policy.
no code implementations • 9 Jul 2020 • Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, Ole J. Mengshoel
We propose a novel framework for structured bandits, which we call an influence diagram bandit.
no code implementations • ICLR 2020 • Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, Benjamin Van Roy
This generalizes and extends the use of ensembles to approximate Thompson sampling.
no code implementations • ACL 2020 • Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Dinghan Shen, Guoyin Wang, Zheng Wen, Lawrence Carin
Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation.
no code implementations • 20 Jan 2020 • Ruiyi Zhang, Changyou Chen, Zhe Gan, Zheng Wen, Wenlin Wang, Lawrence Carin
Reinforcement learning (RL) has been widely studied for improving sequence-generation models.
no code implementations • NeurIPS 2019 • Botao Hao, Yasin Abbasi-Yadkori, Zheng Wen, Guang Cheng
Upper Confidence Bound (UCB) method is arguably the most celebrated one used in online decision making with partial information feedback.
no code implementations • 20 Apr 2019 • Branislav Kveton, Saied Mahdian, S. Muthukrishnan, Zheng Wen, Yikun Xian
We design an online learning algorithm for solving this problem, which interleaves learning and optimization, and prove that this algorithm has sublinear regret.
no code implementations • 4 Mar 2019 • Shuai Li, Wei Chen, Zheng Wen, Kwong-Sak Leung
We consider a problem of stochastic online learning with general probabilistic graph feedback, where each directed edge in the feedback graph has probability $p_{ij}$.
no code implementations • 19 Feb 2019 • Ruiyi Zhang, Zheng Wen, Changyou Chen, Lawrence Carin
Thompson sampling (TS) is a class of algorithms for sequential decision-making, which requires maintaining a posterior distribution over a model.
no code implementations • NeurIPS 2018 • Georgios Theocharous, Zheng Wen, Yasin Abbasi, Nikos Vlassis
Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity.
no code implementations • 13 Nov 2018 • Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Mohammad Ghavamzadeh, Tor Lattimore
Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards.
no code implementations • 1 Nov 2018 • Prakhar Gupta, Gaurush Hiranandani, Harvineet Singh, Branislav Kveton, Zheng Wen, Iftikhar Ahamath Burhanuddin
We assume that the user examines the list of recommended items until the user is attracted by an item, which is clicked, and does not examine the rest of the items.
no code implementations • 3 Jun 2018 • Sumeet Katariya, Branislav Kveton, Zheng Wen, Vamsi K. Potluru
In many practical problems, a learning agent may want to learn the best action in hindsight without ever taking a bad action, which is significantly worse than the default production action.
no code implementations • 24 May 2018 • Sharan Vaswani, Branislav Kveton, Zheng Wen, Anup Rao, Mark Schmidt, Yasin Abbasi-Yadkori
We investigate the use of bootstrapping in the bandit setting.
no code implementations • 27 Apr 2018 • Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, Zheng Wen
We analyze our estimators and prove that they are more efficient than the estimators that do not use the structure of the click model, under the assumption that the click model holds.
no code implementations • 11 Feb 2018 • Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie
Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions.
no code implementations • 13 Dec 2017 • Branislav Kveton, Csaba Szepesvari, Anup Rao, Zheng Wen, Yasin Abbasi-Yadkori, S. Muthukrishnan
Many problems in computer vision and recommender systems involve low-rank matrices.
no code implementations • 21 Nov 2017 • Georgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, Nikos Vlassis
Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity.
no code implementations • 21 Sep 2017 • Tong Yu, Branislav Kveton, Zheng Wen, Hung Bui, Ole J. Mengshoel
We study the problem of learning a latent variable model from a stream of data.
2 code implementations • 7 Jul 2017 • Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.
no code implementations • 6 Jun 2017 • Paul Grigas, Alfonso Lobos, Zheng Wen, Kuang-Chih Lee
We develop an optimization model and corresponding algorithm for the management of a demand-side platform (DSP), whereby the DSP aims to maximize its own profit while acquiring valuable impressions for its advertiser clients.
Optimization and Control Computer Science and Game Theory
no code implementations • 22 Mar 2017 • Ian Osband, Benjamin Van Roy, Daniel Russo, Zheng Wen
We study the use of randomized value functions to guide deep exploration in reinforcement learning.
no code implementations • 19 Mar 2017 • Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, Zheng Wen
The probability that a user will click a search result depends both on its relevance and its position on the results page.
no code implementations • ICML 2017 • Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, Zheng Wen
In this work, we propose BatchRank, the first online learning to rank algorithm for a broad class of click models.
no code implementations • ICML 2017 • Sharan Vaswani, Branislav Kveton, Zheng Wen, Mohammad Ghavamzadeh, Laks Lakshmanan, Mark Schmidt
We consider influence maximization (IM) in social networks, which is the problem of maximizing the number of users that become aware of a product by selecting a set of "seed" users to expose the product to.
no code implementations • 25 Jan 2017 • Shi Zong, Branislav Kveton, Shlomo Berkovsky, Azin Ashkan, Nikos Vlassis, Zheng Wen
To the best of our knowledge, this is the first large-scale causal study of the impact of weather on TV watching patterns.
no code implementations • 10 Aug 2016 • Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, Zheng Wen
The main challenge of the problem is that the individual values of the row and column are unobserved.
1 code implementation • NeurIPS 2017 • Zheng Wen, Branislav Kveton, Michal Valko, Sharan Vaswani
Specifically, we aim to learn the set of "best influencers" in a social network online while repeatedly interacting with it.
1 code implementation • 17 Mar 2016 • Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, Branislav Kveton
In this work, we study cascading bandits, an online learning variant of the cascade model where the goal is to recommend $K$ most attractive items from a large set of $L$ candidate items.
1 code implementation • 9 Feb 2016 • Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Zheng Wen
This work presents the first practical and regret-optimal online algorithm for learning to rank with multiple clicks in a cascade-like click model.
no code implementations • NeurIPS 2015 • Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari
The agent observes the index of the first chosen item whose weight is zero.
no code implementations • 10 Feb 2015 • Branislav Kveton, Csaba Szepesvari, Zheng Wen, Azin Ashkan
We also prove gap-dependent upper bounds on the regret of these algorithms and derive a lower bound on the regret in cascading bandits.
no code implementations • 13 Nov 2014 • Azin Ashkan, Branislav Kveton, Shlomo Berkovsky, Zheng Wen
The need for diversification of recommendation lists manifests in a number of recommender systems use cases.
no code implementations • 3 Oct 2014 • Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari
A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff.
no code implementations • 28 Jun 2014 • Zheng Wen, Branislav Kveton, Azin Ashkan
A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff.
no code implementations • 30 May 2014 • Branislav Kveton, Zheng Wen, Azin Ashkan, Michal Valko
Many important optimization problems, such as the minimum spanning tree and minimum-cost flow, can be solved optimally by a greedy method.
no code implementations • 20 Mar 2014 • Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, Brian Eriksson
The objective in these problems is to learn how to maximize a modular function on a matroid.
1 code implementation • 4 Feb 2014 • Ian Osband, Benjamin Van Roy, Zheng Wen
We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions.
no code implementations • 8 Jan 2014 • Zheng Wen, Daniel O'Neill, Hamid Reza Maei
Demand response (DR) for residential and small commercial buildings is estimated to account for as much as 65% of the total energy savings potential of DR, and previous work shows that a fully automated Energy Management System (EMS) is a necessary prerequisite to DR in these areas.
no code implementations • NeurIPS 2013 • Zheng Wen, Benjamin Van Roy
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.
no code implementations • NeurIPS 2013 • Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan
Maximization of submodular functions has wide applications in machine learning and artificial intelligence.
no code implementations • 18 Jul 2013 • Zheng Wen, Benjamin Van Roy
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization.