We evaluated various compositional models, from bag-of-words representations
to compositional RNN-based models, on several extrinsic supervised and
unsupervised evaluation benchmarks. Our results confirm that weighted vector
averaging can outperform context-sensitive models in most benchmarks, but
structural features encoded in RNN models can also be useful in certain
We analyzed some of the evaluation datasets to identify
the aspects of meaning they measure and the characteristics of the various
models that explain their performance variance.