no code implementations • 5 Jul 2024 • Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li
In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
no code implementations • 29 Nov 2020 • Jinlin Lai, Lixin Zou, Jiaxing Song
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies.
no code implementations • 13 Feb 2019 • Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, Dawei Yin
Though reinforcement learning~(RL) naturally fits the problem of maximizing the long term rewards, applying RL to optimize long-term user engagement is still facing challenges: user behaviors are versatile and difficult to model, which typically consists of both instant feedback~(e. g. clicks, ordering) and delayed feedback~(e. g. dwell time, revisit); in addition, performing effective off-policy learning is still immature, especially when combining bootstrapping and function approximation.