Preference-based Pure Exploration

4 Dec 2024  ·  Apurv Shukla, Debabrota Basu ·

We study the preference-based pure exploration problem for bandits with vector-valued rewards. The rewards are ordered using a (given) preference cone $\mathcal{C}$ and our goal is to identify the set of Pareto optimal arms. First, to quantify the impact of preferences, we derive a novel lower bound on sample complexity for identifying the most preferred policy with a confidence level $1-\delta$. Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to existing best-arm identification variants of the problem. We further explicate this geometry when the rewards follow Gaussian distributions. We then provide a convex relaxation of the lower bound and leverage it to design the Preference-based Track and Stop (PreTS) algorithm that identifies the most preferred policy. Finally, we show that the sample complexity of PreTS is asymptotically tight by deriving a new concentration inequality for vector-valued rewards.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods