1 code implementation • 9 Apr 2025 • Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over $10^{400}$ possible goals.
2 code implementations • 15 Feb 2024 • Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
To create a benchmark, researchers must choose a dataset of forbidden prompts to which a victim model will respond, along with an evaluation method that scores the harmfulness of the victim model's responses.
no code implementations • 2 Nov 2023 • Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset.
no code implementations • 23 Oct 2023 • Rachel Freedman, Justin Svegliato, Kyle Wray, Stuart Russell
The HUB framework and ATS algorithm demonstrate the importance of leveraging differences between teachers to learn accurate reward models, facilitating future research on active teacher selection for robust reward modeling.
no code implementations • 2 Mar 2023 • Peter Barnett, Rachel Freedman, Justin Svegliato, Stuart Russell
Reward learning algorithms utilize human feedback to infer a reward function, which is then used to train an AI system.
no code implementations • 13 Jan 2023 • Samer B. Nashed, Justin Svegliato, Su Lin Blodgett
As automated decision making and decision assistance systems become common in everyday life, research on the prevention or mitigation of potential harms that arise from decisions made by these systems has proliferated.
1 code implementation • 1 Aug 2021 • Shane Parr, Ishan Khatri, Justin Svegliato, Shlomo Zilberstein
Autonomous systems often operate in environments where the behavior of multiple agents is coordinated by a shared global state.
no code implementations • 23 Jul 2020 • Connor Basich, Justin Svegliato, Kyle Hollins Wray, Stefan J. Witwicki, Shlomo Zilberstein
Given the complexity of real-world, unstructured domains, it is often impossible or impractical to design models that include every feature needed to handle all possible scenarios that an autonomous system may encounter.
no code implementations • 17 Mar 2020 • Connor Basich, Justin Svegliato, Kyle Hollins Wray, Stefan Witwicki, Joydeep Biswas, Shlomo Zilberstein
Interest in semi-autonomous systems (SAS) is growing rapidly as a paradigm to deploy autonomous systems in domains that require occasional reliance on humans.