We create a new NLI test set that shows the deficiency of state-of-the-art
models in inferences that require lexical and world knowledge. The new examples
are simpler than the SNLI test set, containing sentences that differ by at most
one word from sentences in the training set. Yet, the performance on the new
test set is substantially worse across systems trained on SNLI, demonstrating
that these systems are limited in their generalization ability, failing to
capture many simple inferences.