Search Results for author: Dillon Bowen

Found 3 papers, 1 papers with code

A StrongREJECT for Empty Jailbreaks

1 code implementation15 Feb 2024 Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer

We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks.

Simple models predict behavior at least as well as behavioral scientists

no code implementations1 Aug 2022 Dillon Bowen

We compared the behavioral scientists' predictions to random chance, linear models, and simple heuristics like "behavioral interventions have no effect" and "all published psychology research is false."

Generalized SHAP: Generating multiple types of explanations in machine learning

no code implementations12 Jun 2020 Dillon Bowen, Lyle Ungar

Many important questions about a model cannot be answered just by explaining how much each feature contributes to its output.

BIG-bench Machine Learning General Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.