Predicting future trajectories of road agents is a critical task for autonomous driving.
3 code implementations • 29 May 2019 • Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Morgane Lustman, Vince Gatto, Paul Covington, Jim McFadden, Tushar Chandra, Craig Boutilier
(i) We develop SLATEQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates.
The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing the value of exploration.
YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence.