Search Results for author: Mrinank Sharma

Found 17 papers, 7 papers with code

Best-of-N Jailbreaking

1 code implementation4 Dec 2024 John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3. 5 Sonnet when sampling 10, 000 augmented prompts.

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

no code implementations3 Dec 2024 Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem.

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

no code implementations26 Nov 2024 Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan

As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them.

Code Generation

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

no code implementations12 Nov 2024 Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma

We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks.

Adversarial Robustness

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

1 code implementation11 Oct 2024 Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks.

Data Poisoning Language Modeling +2

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

no code implementations21 Jul 2024 Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez

These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

Instruction Following Language Modelling +1

Towards Understanding Sycophancy in Language Models

1 code implementation20 Oct 2023 Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.

Text Generation

Understanding and Controlling a Maze-Solving Policy Network

no code implementations12 Oct 2023 Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, Alexander Matt Turner

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares.

Incorporating Unlabelled Data into Bayesian Neural Networks

no code implementations4 Apr 2023 Mrinank Sharma, Tom Rainforth, Yee Whye Teh, Vincent Fortuin

Conventional Bayesian Neural Networks (BNNs) are unable to leverage unlabelled data to improve their predictions.

Active Learning Self-Supervised Learning +1

Do Bayesian Neural Networks Need To Be Fully Stochastic?

2 code implementations11 Nov 2022 Mrinank Sharma, Sebastian Farquhar, Eric Nalisnick, Tom Rainforth

We investigate the benefit of treating all the parameters in a Bayesian neural network stochastically and find compelling theoretical and empirical evidence that this standard construction may be unnecessary.

Cannot find the paper you are looking for? You can Submit a new open access paper.