A3C

Introduced by Mnih et al. in Asynchronous Methods for Deep Reinforcement Learning

A3C, Asynchronous Advantage Actor Critic, is a policy gradient algorithm in reinforcement learning that maintains a policy $\pi\left(a_{t}\mid{s}_{t}; \theta\right)$ and an estimate of the value function $V\left(s_{t}; \theta_{v}\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t_{\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\nabla_{\theta{'}}\log\pi\left(a_{t}\mid{s_{t}}; \theta{'}\right)A\left(s_{t}, a_{t}; \theta, \theta_{v}\right)$ where $A\left(s_{t}, a_{t}; \theta, \theta_{v}\right)$ is an estimate of the advantage function given by:

$$\sum^{k-1}_{i=0}\gamma^{i}r_{t+i} + \gamma^{k}V\left(s_{t+k}; \theta_{v}\right) - V\left(s_{t}; \theta_{v}\right)$$

where $k$ can vary from state to state and is upper-bounded by $t_{max}$.

The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.

Note that while the parameters $\theta$ of the policy and $\theta_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one softmax output for the policy $\pi\left(a_{t}\mid{s}_{t}; \theta\right)$ and one linear output for the value function $V\left(s_{t}; \theta_{v}\right)$, with all non-output layers shared.

Source: Asynchronous Methods for Deep Reinforcement Learning

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Reinforcement Learning (RL)	38	42.70%
Atari Games	12	13.48%
Decision Making	4	4.49%
Autonomous Driving	3	3.37%
Continuous Control	2	2.25%
Multi-agent Reinforcement Learning	2	2.25%
OpenAI Gym	2	2.25%
Problem Decomposition	2	2.25%
Thompson Sampling	1	1.12%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Convolution	Convolutions
Dense Connections	Feedforward Networks
Entropy Regularization	Regularization
RMSProp	Stochastic Optimization	(optional)
Softmax	Output Functions

Categories

Add Remove

Policy Gradient Methods

A3C

Papers

Tasks

Usage Over Time

Components

Categories Edit Add Remove

Categories

Add Remove