Recent work has shown that recurrent neural networks (RNNs) can implicitly
capture and exploit hierarchical information when trained to solve common
natural language processing tasks such as language modeling (Linzen et al.,
2016) and neural machine translation (Shi et al., 2016). In contrast, the
ability to model structured data with non-recurrent neural networks has
received little attention despite their success in many NLP tasks (Gehring et
al., 2017; Vaswani et al., 2017)...
In this work, we compare the two
architectures---recurrent versus non-recurrent---with respect to their ability
to model hierarchical structure and find that recurrency is indeed important
for this purpose.