1 code implementation • 20 Feb 2024 • Benjamin Plaut, Khanh Nguyen, Tu Trinh
Although large language models (LLMs) perform impressively on many tasks, overconfidence remains a problem.
1 code implementation • 15 Feb 2024 • Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks.
no code implementations • 28 Nov 2022 • Tu Trinh, Haoyu Chen, Daniel S. Brown
We evaluate our approach in simulation for both discrete and continuous state-space domains and illustrate the feasibility of developing a robotic system that can accurately evaluate demonstration sufficiency.