Not-so fine-tuning: Measures of Common Sense for Language Models

29 Sep 2021 · Darren Abramson, Ali Emami ·

Language models built using semi-supervised machine learning on large corpora of natural language have very quickly enveloped the fields of natural language generation and understanding. In this paper, we examine some critical assessments concerning the development and subsequent evaluation of language models and offer an alternative account. We provide evidence for the following conclusion: a language model with relatively few parameters, trained for relatively few steps, can perform robustly across language tasks in a manner that demonstrates compositionality, at the cost of GPU-time for language evaluation. The zero-shot measurement technique we advocate for is an application of pseudo-log likelihoods to masked language models for the relative measurement of probability for substitution alternatives in forced choice language tasks such as the Winograd Schema Challenge, Winogrande, CommonsenseQA, as well as on a minimal adversarial test set we create, dubbing it \textit{Winogradversarial}. In some cases, our results are `state-of-the-art' (SOTA) in an absolute sense, performing better than any published result in the literature. In others our results are SOTA relative to published methods similar to or identical to our own -- in some cases by wide margins, but below SOTA absolute. We provide a narrative consistent with our measurement approach that has advantages over problematic prevailing approaches to evaluating and applying language models for common sense.

PDF Abstract