Do-Not-Answer is a dataset to evaluate safeguards in large language models, and deploy safer open-source LLMs at a low cost. The dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We annotate and assess the responses of six popular LLMs to these instructions.
Paper | Code | Results | Date | Stars |
---|