Although Transformer has achieved great successes on many NLP tasks, its heavy structure with fully-connected attention connections leads to dependencies on large training data. In this paper, we present Star-Transformer, a lightweight alternative by careful sparsification. To reduce model complexity, we replace the fully-connected structure with a star-shaped topology, in which every two non-adjacent nodes are connected through a shared relay node. Thus, complexity is reduced from quadratic to linear, while preserving capacity to capture both local composition and long-range dependency. The experiments on four tasks (22 datasets) show that Star-Transformer achieved significant improvements against the standard Transformer for the modestly sized datasets.

PDF Abstract NAACL 2019 PDF NAACL 2019 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Natural Language Inference SNLI Star-Transformer (no cross sentence attention) % Test Accuracy 86.0 # 62
Sentiment Analysis SST-5 Fine-grained classification Star-Transformer Accuracy 53.0 # 12