Search Results for author: Javier Rando

Found 10 papers, 5 papers with code

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

no code implementations • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).

Paper
Add Code

Universal Jailbreak Backdoors from Poisoned Human Feedback

2 code implementations • 24 Nov 2023 • Javier Rando, Florian Tramèr

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses.

Backdoor Attack

Paper
Code

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

no code implementations • 6 Nov 2023 • Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour.

Language Modelling

Paper
Add Code

Personas as a Way to Model Truthfulness in Language Models

no code implementations • 27 Oct 2023 • Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He

This allows the model to separate truth from falsehoods and controls the truthfulness of its generation.

Paper
Add Code

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.

reinforcement-learning

Paper
Add Code

PassGPT: Password Modeling and (Guided) Generation with Large Language Models

1 code implementation • 2 Jun 2023 • Javier Rando, Fernando Perez-Cruz, Briland Hitaj

Large language models (LLMs) successfully model natural language from vast amounts of text without the need for explicit supervision.

Paper
Code

Red-Teaming the Stable Diffusion Safety Filter

no code implementations • 3 Oct 2022 • Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr

We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content.

Image Generation

Paper
Add Code

Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO

1 code implementation • 14 Jun 2022 • Javier Rando, Nasib Naimi, Thomas Baumann, Max Mathys

This work conducts the first analysis on the robustness against adversarial attacks on self-supervised Vision Transformers trained using DINO.

Adversarial Robustness

Paper
Code

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

1 code implementation • 10 Apr 2022 • Edoardo Mosca, Shreyash Agarwal, Javier Rando, Georg Groh

Adversarial attacks are a major challenge faced by current machine learning research.

Adversarial Text

Paper
Code

Uneven Coverage of Natural Disasters in Wikipedia: the Case of Flood

1 code implementation • 23 Jan 2020 • Valerio Lorini, Javier Rando, Diego Saez-Trumper, Carlos Castillo

We also note how coverage of floods in countries with the lowest income, as well as countries in South America, is substantially lower than the coverage of floods in middle-income countries.

Disaster Response Management

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.