Towards Machine Ethics with Language Models

ICLR 2021 · Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt ·

We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense moral judgments. Models predict widespread moral judgments about diverse written scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may later serve as a general regularizer of behavior in open-ended settings. We find that language models have low but nontrivial performance. With the ETHICS dataset, we enable meaningful progress on value learning to be made today, providing a steppingstone toward AI that is aligned with human values.

PDF Abstract