74 papers with code • 0 benchmarks • 0 datasets
Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems.
These leaderboards are used to track progress in Off-policy evaluation
LibrariesUse these libraries to find Off-policy evaluation models and implementations
Our dataset is unique in that it contains a set of multiple logged bandit datasets collected by running different policies on the same platform.
Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance.
We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model.
Training models that perform well under distribution shifts is a central challenge in machine learning.
Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators' performance on a narrow set of hyperparameters and evaluation policies.
We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions.
We consider a personalized pricing problem in which we have data consisting of feature information, historical pricing decisions, and binary realized demand.
We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set.