TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Music Modeling	Nottingham	R-Transformer	NLL	2.37	# 1
Music Modeling	Nottingham	Transformer	NLL	3.34	# 6
Language Modelling	Penn Treebank (Character Level)	R-Transformer	Bit per Character (BPC)	1.24	# 15
Language Modelling	Penn Treebank (Word Level)	R-Transformer	Test perplexity	84.38	# 40
Sequential Image Classification	Sequential MNIST	R-Transformer	Unpermuted Accuracy	99.1%	# 14

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-transformer-recurrent-neural-network/music-modeling-on-nottingham)](https://paperswithcode.com/sota/music-modeling-on-nottingham?p=r-transformer-recurrent-neural-network)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-transformer-recurrent-neural-network/sequential-image-classification-on-sequential)](https://paperswithcode.com/sota/sequential-image-classification-on-sequential?p=r-transformer-recurrent-neural-network)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-transformer-recurrent-neural-network/language-modelling-on-penn-treebank-character)](https://paperswithcode.com/sota/language-modelling-on-penn-treebank-character?p=r-transformer-recurrent-neural-network)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-transformer-recurrent-neural-network/language-modelling-on-penn-treebank-word)](https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word?p=r-transformer-recurrent-neural-network)`

R-Transformer: Recurrent Neural Network Enhanced Transformer

ICLR 2020 · Zhiwei Wang, Yao Ma, Zitao Liu, Jiliang Tang ·

Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks. The proposed model can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. We evaluate R-Transformer through extensive experiments with data from a wide range of domains and the empirical results show that R-Transformer outperforms the state-of-the-art methods by a large margin in most of the tasks. We have made the code publicly available at \url{https://github.com/DSE-MSU/R-transformer}.

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract

Code

Add Remove Mark official

DSE-MSU/R-transformer official

223

sfox14/butterfly-r-transformer

Tasks

Add Remove

Language Modelling

Music Modeling

Position

Sequential Image Classification

Datasets

MNIST

Penn Treebank

Nottingham

Results from the Paper

Edit

Ranked #1 on Music Modeling on Nottingham

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Music Modeling	Nottingham	R-Transformer	NLL	2.37	# 1	Compare
Music Modeling	Nottingham	Transformer	NLL	3.34	# 6	Compare
Language Modelling	Penn Treebank (Character Level)	R-Transformer	Bit per Character (BPC)	1.24	# 15	Compare
Language Modelling	Penn Treebank (Word Level)	R-Transformer	Test perplexity	84.38	# 40	Compare
Sequential Image Classification	Sequential MNIST	R-Transformer	Unpermuted Accuracy	99.1%	# 14	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Convolution • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

R-Transformer: Recurrent Neural Network Enhanced Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove