no code implementations • 14 Dec 2023 • Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit
LLMs often face competing pressures (for example helpfulness vs. harmlessness).
1 code implementation • 18 Nov 2022 • Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell
Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification.