We analyze the performance of different sentiment classification models on
syntactically complex inputs like A-but-B sentences. The first contribution of
this analysis addresses reproducible research: to meaningfully compare
different models, their accuracies must be averaged over far more random seeds
than what has traditionally been reported...
With proper averaging in place, we
notice that the distillation model described in arXiv:1603.06318v4 [cs.LG],
which incorporates explicit logic rules for sentiment classification, is
ineffective. In contrast, using contextualized ELMo embeddings
(arXiv:1802.05365v2 [cs.CL]) instead of logic rules yields significantly better
performance. Additionally, we provide analysis and visualizations that
demonstrate ELMo's ability to implicitly learn logic rules. Finally, a
crowdsourced analysis reveals how ELMo outperforms baseline models even on
sentences with ambiguous sentiment labels.