no code implementations • 21 Aug 2020 • Xu He, Bo An, Yanghua Li, Haikai Chen, Qingyu Guo, Xin Li, Zhirong Wang
First, since we concern the reward of a set of recommended items, we model the online recommendation as a contextual combinatorial bandit problem and define the reward of a recommended set.
no code implementations • 21 Aug 2020 • Xu He, Bo An, Yanghua Li, Haikai Chen, Rundong Wang, Xinrun Wang, Runsheng Yu, Xin Li, Zhirong Wang
Thus, the global policy of the whole page could be sub-optimal.
Multi-agent Reinforcement Learning Reinforcement Learning (RL)