Lagrangian Method for Episodic Learning

29 Sep 2021  ·  Huang Bojun ·

This paper considers the problem of learning optimal value functions for finite-time decision tasks via saddle-point optimization of a nonlinear Lagrangian function that is derived from the $Q$-form Bellman optimality equation. Despite a long history of research on this topic in the literature, previous works on this general approach have been almost exclusively focusing on a linear special case known as the linear programming approach to RL/MDP. Our paper brings new perspectives to this general approach in the following aspects: 1) Inspired by the usually-used linear $V$-form Lagrangian, we proposed a nonlinear $Q$-form Lagrangian function and proved that it enjoys strong duality property in spite of its nonlinearity. The Lagrangian duality property immediately leads to a new imitation learning algorithm, which we applied to Machine Translation and obtained favorable performance on standard MT benchmark. 2) We pointed out a fundamental limit of existing works, which seeks to find minimax-type saddle points of the Lagrangian function. We proved that another class of saddle points, the maximin-type ones, turn out to have better optimality property. 3) In contrast to most previous works, our theory and algorithm are oriented to the undiscounted episode-wise reward, which is practically more relevant than the usually considered discounted-MDP setting, thus have filled a gap between theory and practice on the topic.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here