Aligning AI With Shared Human Values

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

PDF Abstract


Introduced in the Paper:


Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Average hendrycks2020ethics ALBERT-xxlarge Accuracy (Test) 0.71 # 1
Average hendrycks2020ethics RoBERTa-large Accuracy (Test) 0.68 # 2
Average hendrycks2020ethics BERT-large Accuracy (Test) 0.561 # 3
Average hendrycks2020ethics BERT-base Accuracy (Test) 0.516 # 4
Average hendrycks2020ethics GPT-3 (few-shot) Accuracy (Test) 0.368 # 5
Average hendrycks2020ethics Random Baseline Accuracy (Test) 0.24.2 # 6


No methods listed for this paper. Add relevant methods here