Recurrent Neural Networks

# Quasi-Recurrent Neural Network

Introduced by Bradbury et al. in Quasi-Recurrent Neural Networks

A QRNN, or Quasi-Recurrent Neural Network, is a type of recurrent neural network that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. Due to their increased parallelism, they can be up to 16 times faster at train and test time than LSTMs.

Given an input sequence $\mathbf{X} \in \mathbb{R}^{T\times{n}}$ of $T$ n-dimensional vectors $\mathbf{x}_{1}, \dots, \mathbf{x}_{T}$, the convolutional subcomponent of a QRNN performs convolutions in the timestep dimension with a bank of $m$ filters, producing a sequence $\mathbf{Z} \in \mathbb{R}^{T\times{m}}$ of m-dimensional candidate vectors $\mathbf{z}_{t}$. Masked convolutions are used so filters can not access information from future timesteps (implementing with left padding).

Additional convolutions are applied with separate filter banks to obtain sequences of vectors for the elementwise gates that are needed for the pooling function. While the candidate vectors are passed through a $\tanh$ nonlinearity, the gates use an elementwise sigmoid. If the pooling function requires a forget gate $f_{t}$ and an output gate $o_{t}$ at each timestep, the full set of computations in the convolutional component is then:

$$\mathbf{Z} = \tanh\left(\mathbf{W}_{z} ∗ \mathbf{X}\right)$$ $$\mathbf{F} = \sigma\left(\mathbf{W}_{f} ∗ \mathbf{X}\right)$$ $$\mathbf{O} = \sigma\left(\mathbf{W}_{o} ∗ \mathbf{X}\right)$$

where $\mathbf{W}_{z}$, $\mathbf{W}_{f}$, and $\mathbf{W}_{o}$, each in $\mathbb{R}^{k×n×m}$, are the convolutional filter banks and ∗ denotes a masked convolution along the timestep dimension. Dynamic average pooling by Balduzzi & Ghifary (2016) is used, which uses only a forget gate:

$$\mathbf{h}_{t} = \mathbf{f}_{t} \odot{\mathbf{h}_{t−1}} + \left(1 − \mathbf{f}_{t}\right) \odot{?\mathbf{z}_{t}}$$

Which is denoted f-pooling. The function may also include an output gate:

$$\mathbf{c}_{t} = \mathbf{f}_{t} \odot{\mathbf{c}_{t−1}} + \left(1 − \mathbf{f}_{t}\right) \odot{?\mathbf{z}_{t}}$$

$$\mathbf{h}_{t} = \mathbf{o}_{t} \odot{\mathbf{c}_{t}}$$

Which is denoted fo-pooling. Or the recurrence relation may include an independent input and forget gate:

$$\mathbf{c}_{t} = \mathbf{f}_{t} \odot{\mathbf{c}_{t−1}} + \mathbf{i}_{t}\odot{?\mathbf{z}_{t}}$$

$$\mathbf{h}_{t} = \mathbf{o}_{t} \odot{\mathbf{c}_{t}}$$

Which is denoted ifo-pooling. In each case $h$ or $c$ is initialized to zero. The recurrent part sof these functions must be calculated for each timestep in the sequence, but parallelism along feature dimensions means evaluating them even over long sequences requires a negligible amount of computation time.

A single QRNN layer thus performs an input-dependent pooling, followed by a gated linear combination of convolutional features. As with convolutional neural networks, two or more QRNN layers should be stacked to create a model with the capacity to approximate more complex functions.

#### Papers

Paper Code Results Date Stars