1 code implementation • 12 Jun 2024 • Edoardo Debenedetti, Javier Rando, Daniel Paleka, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr
To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt.
2 code implementations • 22 Apr 2024 • Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr
Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities.
1 code implementation • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Aleksandar Petrov, Christian Schroeder de Witt, Sumeet Ramesh Motwan, Yoshua Bengio, Danqi Chen, Philip H. S. Torr, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).
no code implementations • 3 Apr 2024 • Eyal Aharoni, Sharlene Fernandes, Daniel J. Brady, Caelan Alexander, Michael Criner, Kara Queen, Javier Rando, Eddy Nahmias, Victor Crespo
Advances in artificial intelligence (AI) raise important questions about whether people view moral evaluations by AI systems similarly to human-generated moral evaluations.
2 code implementations • 24 Nov 2023 • Javier Rando, Florian Tramèr
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses.
no code implementations • 6 Nov 2023 • Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando
Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour.
no code implementations • 27 Oct 2023 • Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He
This allows the model to separate truth from falsehoods and controls the truthfulness of its generation.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
1 code implementation • 2 Jun 2023 • Javier Rando, Fernando Perez-Cruz, Briland Hitaj
Large language models (LLMs) successfully model natural language from vast amounts of text without the need for explicit supervision.
no code implementations • 3 Oct 2022 • Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr
We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content.
1 code implementation • 14 Jun 2022 • Javier Rando, Nasib Naimi, Thomas Baumann, Max Mathys
This work conducts the first analysis on the robustness against adversarial attacks on self-supervised Vision Transformers trained using DINO.
1 code implementation • 10 Apr 2022 • Edoardo Mosca, Shreyash Agarwal, Javier Rando, Georg Groh
Adversarial attacks are a major challenge faced by current machine learning research.
1 code implementation • 23 Jan 2020 • Valerio Lorini, Javier Rando, Diego Saez-Trumper, Carlos Castillo
We also note how coverage of floods in countries with the lowest income, as well as countries in South America, is substantially lower than the coverage of floods in middle-income countries.