Recurrent Neural Networks

# SRU

Introduced by Lei et al. in Simple Recurrent Units for Highly Parallelizable Recurrence

SRU, or Simple Recurrent Unit, is a recurrent neural unit with a light form of recurrence. SRU exhibits the same level of parallelism as convolution and feed-forward nets. This is achieved by balancing sequential dependence and independence: while the state computation of SRU is time-dependent, each state dimension is independent. This simplification enables CUDA-level optimizations that parallelize the computation across hidden dimensions and time steps, effectively using the full capacity of modern GPUs.

SRU also replaces the use of convolutions (i.e., ngram filters), as in QRNN and KNN, with more recurrent connections. This retains modeling capacity, while using less computation (and hyper-parameters). Additionally, SRU improves the training of deep recurrent models by employing highway connections and a parameter initialization scheme tailored for gradient propagation in deep architectures.

A single layer of SRU involves the following computation:

$$\mathbf{f}_{t} =\sigma\left(\mathbf{W}_{f} \mathbf{x}_{t}+\mathbf{v}_{f} \odot \mathbf{c}_{t-1}+\mathbf{b}_{f}\right)$$

$$\mathbf{c}_{t} =\mathbf{f}_{t} \odot \mathbf{c}_{t-1}+\left(1-\mathbf{f}_{t}\right) \odot\left(\mathbf{W} \mathbf{x}_{t}\right) \$$

$$\mathbf{r}_{t} =\sigma\left(\mathbf{W}_{r} \mathbf{x}_{t}+\mathbf{v}_{r} \odot \mathbf{c}_{t-1}+\mathbf{b}_{r}\right) \$$

$$\mathbf{h}_{t} =\mathbf{r}_{t} \odot \mathbf{c}_{t}+\left(1-\mathbf{r}_{t}\right) \odot \mathbf{x}_{t}$$

where $\mathbf{W}, \mathbf{W}_{f}$ and $\mathbf{W}_{r}$ are parameter matrices and $\mathbf{v}_{f}, \mathbf{v}_{r}, \mathbf{b}_{f}$ and $\mathbf{b}_{v}$ are parameter vectors to be learnt during training. The complete architecture decomposes to two sub-components: a light recurrence and a highway network,

The light recurrence component successively reads the input vectors $\mathbf{x}_{t}$ and computes the sequence of states $\mathbf{c}_{t}$ capturing sequential information. The computation resembles other recurrent networks such as LSTM, GRU and RAN. Specifically, a forget gate $\mathbf{f}_{t}$ controls the information flow and the state vector $\mathbf{c}_{t}$ is determined by adaptively averaging the previous state $\mathbf{c}_{t-1}$ and the current observation $\mathbf{W} \mathbf{x}_{+}$according to $\mathbf{f}_{t}$.

#### Papers

Paper Code Results Date Stars

Speech Recognition 4 18.18%
Language Modelling 2 9.09%
General Classification 2 9.09%
Quantization 1 4.55%