no code implementations • 16 Mar 2023 • Junqi Qian, Paul Weng, Chenmien Tan
LR4GPM alternates between two phases: (1) learning a (possibly vector) reward function used to fit the performance metric, and (2) training a policy to optimize an approximation of this performance metric based on the learned rewards.