2 code implementations • 9 Apr 2024 • Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik
We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems.
Informativeness Language Modelling +1