1 code implementation • 15 Feb 2024 • Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks.
no code implementations • 2 Nov 2023 • Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset.
2 code implementations • 22 Nov 2022 • Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell
imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch.
2 code implementations • 16 May 2022 • Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, Ping Luo, Stuart Russell, Pieter Abbeel, Rohin Shah
We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites.
no code implementations • 22 Mar 2022 • Adam Gleave, Sam Toyer
Inverse Reinforcement Learning (IRL) algorithms infer a reward function that explains demonstrations provided by an expert acting in the environment.
2 code implementations • 2 Dec 2020 • Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell
We evaluate a range of common reward and imitation learning algorithms on our tasks.
1 code implementation • NeurIPS 2020 • Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell
This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings.
1 code implementation • 4 Aug 2019 • Sam Toyer, Felipe Trevizan, Sylvie Thiébaux, Lexing Xie
In this paper, we discuss the learning of generalised policies for probabilistic and classical planning problems using Action Schema Networks (ASNets).
5 code implementations • ICLR 2019 • Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, Sergey Levine
By enforcing a constraint on the mutual information between the observations and the discriminator's internal representation, we can effectively modulate the discriminator's accuracy and maintain useful and informative gradients.
1 code implementation • 13 Sep 2017 • Sam Toyer, Felipe Trevizan, Sylvie Thiébaux, Lexing Xie
In this paper, we introduce the Action Schema Network (ASNet): a neural network architecture for learning generalised policies for probabilistic planning problems.
no code implementations • 24 Jul 2017 • Sam Toyer, Anoop Cherian, Tengda Han, Stephen Gould
Human pose forecasting is an important problem in computer vision with applications to human-robot interaction, visual surveillance, and autonomous driving.