Search Results for author: Tu Trinh

Found 3 papers, 2 papers with code

Softmax Probabilities (Mostly) Predict Large Language Model Correctness on Multiple-Choice Q&A

1 code implementation • 20 Feb 2024 • Benjamin Plaut, Khanh Nguyen, Tu Trinh

Although large language models (LLMs) perform impressively on many tasks, overconfidence remains a problem.

Language Modelling Large Language Model +1

Paper
Code

A StrongREJECT for Empty Jailbreaks

1 code implementation • 15 Feb 2024 • Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer

We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks.

Paper
Code

Autonomous Assessment of Demonstration Sufficiency via Bayesian Inverse Reinforcement Learning

no code implementations • 28 Nov 2022 • Tu Trinh, Haoyu Chen, Daniel S. Brown

We evaluate our approach in simulation for both discrete and continuous state-space domains and illustrate the feasibility of developing a robotic system that can accurately evaluate demonstration sufficiency.

Active Learning reinforcement-learning +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.