no code implementations • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).
2 code implementations • 24 Nov 2023 • Javier Rando, Florian Tramèr
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses.
no code implementations • 6 Nov 2023 • Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando
Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour.
no code implementations • 27 Oct 2023 • Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He
This allows the model to separate truth from falsehoods and controls the truthfulness of its generation.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
1 code implementation • 2 Jun 2023 • Javier Rando, Fernando Perez-Cruz, Briland Hitaj
Large language models (LLMs) successfully model natural language from vast amounts of text without the need for explicit supervision.
no code implementations • 3 Oct 2022 • Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr
We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content.
1 code implementation • 14 Jun 2022 • Javier Rando, Nasib Naimi, Thomas Baumann, Max Mathys
This work conducts the first analysis on the robustness against adversarial attacks on self-supervised Vision Transformers trained using DINO.
1 code implementation • 10 Apr 2022 • Edoardo Mosca, Shreyash Agarwal, Javier Rando, Georg Groh
Adversarial attacks are a major challenge faced by current machine learning research.
1 code implementation • 23 Jan 2020 • Valerio Lorini, Javier Rando, Diego Saez-Trumper, Carlos Castillo
We also note how coverage of floods in countries with the lowest income, as well as countries in South America, is substantially lower than the coverage of floods in middle-income countries.