A QRNN, or QuasiRecurrent Neural Network, is a type of recurrent neural network that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. Due to their increased parallelism, they can be up to 16 times faster at train and test time than LSTMs.
Given an input sequence $\mathbf{X} \in \mathbb{R}^{T\times{n}}$ of $T$ ndimensional vectors $\mathbf{x}_{1}, \dots, \mathbf{x}_{T}$, the convolutional subcomponent of a QRNN performs convolutions in the timestep dimension with a bank of $m$ filters, producing a sequence $\mathbf{Z} \in \mathbb{R}^{T\times{m}}$ of mdimensional candidate vectors $\mathbf{z}_{t}$. Masked convolutions are used so filters can not access information from future timesteps (implementing with left padding).
Additional convolutions are applied with separate filter banks to obtain sequences of vectors for the elementwise gates that are needed for the pooling function. While the candidate vectors are passed through a $\tanh$ nonlinearity, the gates use an elementwise sigmoid. If the pooling function requires a forget gate $f_{t}$ and an output gate $o_{t}$ at each timestep, the full set of computations in the convolutional component is then:
$$ \mathbf{Z} = \tanh\left(\mathbf{W}_{z} ∗ \mathbf{X}\right) $$ $$ \mathbf{F} = \sigma\left(\mathbf{W}_{f} ∗ \mathbf{X}\right) $$ $$ \mathbf{O} = \sigma\left(\mathbf{W}_{o} ∗ \mathbf{X}\right) $$
where $\mathbf{W}_{z}$, $\mathbf{W}_{f}$, and $\mathbf{W}_{o}$, each in $\mathbb{R}^{k×n×m}$, are the convolutional filter banks and ∗ denotes a masked convolution along the timestep dimension. Dynamic average pooling by Balduzzi & Ghifary (2016) is used, which uses only a forget gate:
$$ \mathbf{h}_{t} = \mathbf{f}_{t} \odot{\mathbf{h}_{t−1}} + \left(1 − \mathbf{f}_{t}\right) \odot{?\mathbf{z}_{t}} $$
Which is denoted fpooling. The function may also include an output gate:
$$ \mathbf{c}_{t} = \mathbf{f}_{t} \odot{\mathbf{c}_{t−1}} + \left(1 − \mathbf{f}_{t}\right) \odot{?\mathbf{z}_{t}} $$
$$ \mathbf{h}_{t} = \mathbf{o}_{t} \odot{\mathbf{c}_{t}} $$
Which is denoted fopooling. Or the recurrence relation may include an independent input and forget gate:
$$ \mathbf{c}_{t} = \mathbf{f}_{t} \odot{\mathbf{c}_{t−1}} + \mathbf{i}_{t}\odot{?\mathbf{z}_{t}} $$
$$ \mathbf{h}_{t} = \mathbf{o}_{t} \odot{\mathbf{c}_{t}} $$
Which is denoted ifopooling. In each case $h$ or $c$ is initialized to zero. The recurrent part sof these functions must be calculated for each timestep in the sequence, but parallelism along feature dimensions means evaluating them even over long sequences requires a negligible amount of computation time.
A single QRNN layer thus performs an inputdependent pooling, followed by a gated linear combination of convolutional features. As with convolutional neural networks, two or more QRNN layers should be stacked to create a model with the capacity to approximate more complex functions.
Source: QuasiRecurrent Neural NetworksPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Language Modelling  2  33.33% 
Sentiment Analysis  2  33.33% 
General Classification  1  16.67% 
Machine Translation  1  16.67% 
Component  Type 


Convolution

Convolutions  
Masked Convolution

Convolutions  
Sigmoid Activation

Activation Functions  
Tanh Activation

Activation Functions 