TD Lambda

TD_INLINE_MATH_1 is a generalisation of TD_INLINE_MATH_2 reinforcement learning algorithms, but it employs an eligibility trace $\lambda$ and $\lambda$-weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by $\gamma\lambda$:

$$ \textbf{z}_{-1} = \mathbf{0} $$ $$ \textbf{z}_{t} = \gamma\lambda\textbf{z}_{t-1} + \nabla\hat{v}\left(S_{t}, \mathbf{w}_{t}\right), 0 \leq t \leq T$$

The eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\nabla\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)$ is the feature vector.

The TD error for state-value prediction is:

$$ \delta_{t} = R_{t+1} + \gamma\hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right) - \hat{v}\left(S_{t}, \mathbf{w}_{t}\right) $$

In TD_INLINE_MATH_1, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:

$$ \mathbf{w}_{t+1} = \mathbf{w}_{t} + \alpha\delta\mathbf{z}_{t} $$

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Starcraft II	7	29.17%
Starcraft	6	25.00%
Reinforcement Learning (RL)	4	16.67%
Decision Making	2	8.33%
Language Modelling	1	4.17%
Large Language Model	1	4.17%
Offline RL	1	4.17%
Imitation Learning	1	4.17%
Hierarchical Reinforcement Learning	1	4.17%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Accumulating Eligibility Trace	Eligibility Traces

Categories

Add Remove

On-Policy TD Control