To address the above challenges, we propose a Bias Agnostic whole-page unbiased Learning to rank algorithm, BAL, to automatically discover and mitigate the biases from multiple SERP features with no specific design.
We further instantiate the proposed unbiased relevance estimation framework in Baidu search, with comprehensive practical solutions designed regarding the data pipeline for click behavior tracking and online relevance estimation with an approximated deep neural network.
Specifically, we develop a general context-aware user simulator to generate pseudo clicks for unobserved ranked lists to train rankers, which addresses the data sparsity problem.
The unbiased learning to rank (ULTR) problem has been greatly advanced by recent deep learning techniques and well-designed debias algorithms.
Beyond topical relevance, passage ranking for open-domain factoid question answering also requires a passage to contain an answer (answerability).
To understand the challenges and relevant solutions, there should be a reference for researchers and practitioners working on RL-based recommender systems.
Based on it, a more robust doubly robust (MRDR) estimator has been proposed to further reduce its variance while retaining its double robustness.
However, it is nontrivial to directly apply these PLM-based rankers to the large-scale web search system due to the following challenging issues:(1) the prohibitively expensive computations of massive neural PLMs, especially for long texts in the web-document, prohibit their deployments in an online ranking system that demands extremely low latency;(2) the discrepancy between existing ranking-agnostic pre-training objectives and the ad-hoc retrieval scenarios that demand comprehensive relevance modeling is another main barrier for improving the online ranking system;(3) a real-world search engine typically involves a committee of ranking components, and thus the compatibility of the individually fine-tuned ranking model is critical for a cooperative ranking system.
This work introduces a practical, data-efficient policy learning method, named Variance-Bonus Monte Carlo Tree Search~(VB-MCTS), which can copy with very little data and facilitate learning from scratch in only a few trials.
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies.
Distant supervision provides a means to create a large number of weakly labeled data at low cost for relation classification.
Therefore, the proposed exploration policy, to balance between learning the user profile and making accurate recommendations, can be directly optimized by maximizing users' long-term satisfaction with reinforcement learning.
Specifically, we introduce a spatio-temporal graph convolutional network to capture both spatial and temporal relations from dynamic AUs, in which the AU relations are formulated as a spatio-temporal graph with adaptively learned instead of predefined edge weights.
Thus, it calls for a user simulator that can mimic real users' behaviors where we can pre-train and evaluate new recommendation algorithms.
Though reinforcement learning~(RL) naturally fits the problem of maximizing the long term rewards, applying RL to optimize long-term user engagement is still facing challenges: user behaviors are versatile and difficult to model, which typically consists of both instant feedback~(e. g. clicks, ordering) and delayed feedback~(e. g. dwell time, revisit); in addition, performing effective off-policy learning is still immature, especially when combining bootstrapping and function approximation.