Search Results for author: Fazl Barez

Found 35 papers, 18 papers with code

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

no code implementations30 May 2025 Narmeen Oozeer, Luke Marks, Fazl Barez, Amirali Abdullah

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning.

Attribute

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

no code implementations20 May 2025 Maheep Chaudhary, Fazl Barez

We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers.

Scaling sparse feature circuit finding for in-context learning

no code implementations18 Apr 2025 Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, Neel Nanda

In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL).

In-Context Learning Large Language Model

Do Sparse Autoencoders Generalize? A Case Study of Answerability

no code implementations27 Feb 2025 Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost

We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs.

Language Modeling Language Modelling

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

no code implementations19 Feb 2025 Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade--Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami, Usman Gohar, Ben Huang, Supheakmungkol Sarin, Elie Alhajjar, Canyu Chen, Roman Eng, Kashyap Ramanandula Manjusha, Virendra Mehta, Eileen Long, Murali Emani, Natan Vidra, Benjamin Rukundo, Abolfazl Shahbazi, Kongtao Chen, Rajat Ghosh, Vithursan Thangarasa, Pierre Peigné, Abhinav Singh, Max Bartolo, Satyapriya Krishna, Mubashara Akhtar, Rafael Gold, Cody Coleman, Luis Oala, Vassil Tashev, Joseph Marvin Imperial, Amy Russ, Sasidhar Kunapuli, Nicolas Miailhe, Julien Delaunay, Bhaktipriya Radharapu, Rajat Shinde, Tuesday, Debojyoti Dutta, Declan Grabb, Ananya Gangavarapu, Saurav Sahay, Agasthya Gangavarapu, Patrick Schramowski, Stephen Singam, Tom David, Xudong Han, Priyanka Mary Mammen, Tarunima Prabhakar, Venelin Kovatchev, Rebecca Weiss, Ahmed Ahmed, Kelvin N. Manyeki, Sandeep Madireddy, Foutse khomh, Fedor Zhdanov, Joachim Baumann, Nina Vasan, Xianjun Yang, Carlos Mougn, Jibin Rajan Varghese, Hussain Chinoy, Seshakrishna Jitendar, Manil Maskey, Claire V. Hardgrove, TianHao Li, Aakash Gupta, Emil Joswin, Yifan Mai, Shachi H Kumar, Cigdem Patlak, Kevin Lu, Vincent Alessi, Sree Bhargavi Balija, Chenhe Gu, Robert Sullivan, James Gealy, Matt Lavrisa, James Goel, Peter Mattson, Percy Liang, Joaquin Vanschoren

This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories.

Rethinking AI Cultural Alignment

no code implementations13 Jan 2025 Michal Bravansky, Filip Trhlik, Fazl Barez

As general-purpose artificial intelligence (AI) systems become increasingly integrated with diverse human communities, cultural alignment has emerged as a crucial element in their deployment.

Multiple-choice

Open Problems in Machine Unlearning for AI Safety

no code implementations9 Jan 2025 Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount.

Machine Unlearning

Best-of-N Jailbreaking

1 code implementation4 Dec 2024 John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3. 5 Sonnet when sampling 10, 000 augmented prompts.

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

no code implementations3 Dec 2024 Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem.

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

1 code implementation2 Nov 2024 Luke Marks, Alasdair Paren, David Krueger, Fazl Barez

By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data.

EEG

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

1 code implementation11 Oct 2024 Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks.

Data Poisoning Language Modeling +2

Towards Interpreting Visual Information Processing in Vision-Language Models

1 code implementation9 Oct 2024 Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez

Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions.

Language Modeling Language Modelling +1

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders

1 code implementation9 Oct 2024 Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez

Providing evidence for this hypothesis would enable researchers to exploit universal properties, facilitating the generalization of mechanistic interpretability techniques across models.

Dictionary Learning

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

1 code implementation14 Jun 2024 Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments.

Language Modelling Large Language Model

Visualizing Neural Network Imagination

no code implementations10 May 2024 Nevan Wichers, Victor Tao, Riccardo Volpato, Fazl Barez

Our goal is to visualize what environment states the networks are representing.

Decoder

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

no code implementations23 Feb 2024 Clement Neo, Shay B. Cohen, Fazl Barez

Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications.

Text Generation

Increasing Trust in Language Models through the Reuse of Verified Circuits

2 code implementations4 Feb 2024 Philip Quirke, Clement Neo, Fazl Barez

To exhibit the reusability of verified modules, we insert the trained integer addition model into a larger untrained model and train the combined model to perform both addition and subtraction.

Large Language Models Relearn Removed Concepts

1 code implementation3 Jan 2024 Michelle Lo, Shay B. Cohen, Fazl Barez

This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons.

Model Editing

Measuring Value Alignment

no code implementations23 Dec 2023 Fazl Barez, Philip Torr

As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical.

Autonomous Vehicles Recommendation Systems

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

1 code implementation7 Nov 2023 Michael Lan, Philip Torr, Fazl Barez

We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months.

Language Modelling Large Language Model +1

Understanding Addition in Transformers

4 code implementations19 Oct 2023 Philip Quirke, Fazl Barez

Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use.

Interpreting Learned Feedback Patterns in Large Language Models

1 code implementation12 Oct 2023 Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, Fazl Barez

Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe's predictions.

AI Systems of Concern

no code implementations9 Oct 2023 Kayla Matteucci, Shahar Avin, Fazl Barez, Seán Ó hÉigeartaigh

Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning.

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

1 code implementation3 Oct 2023 Albert Garde, Esben Kran, Fazl Barez

By granting access to state-of-the-art interpretability methods, DeepDecipher makes LLMs more transparent, trustworthy, and safe.

Neuron to Graph: Interpreting Language Model Neurons at Scale

1 code implementation31 May 2023 Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.

Language Modeling Language Modelling

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

1 code implementation27 May 2023 Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, Fazl Barez

We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity.

Model Editing Specificity

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

1 code implementation24 May 2023 Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen

Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming.

Code Generation

System III: Learning with Domain Knowledge for Safety Constraints

no code implementations23 Apr 2023 Fazl Barez, Hosien Hasanbieg, Alesandro Abbate

We evaluate the satisfaction of these constraints via p-norms in state vector space.

Safe Exploration

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

no code implementations22 Apr 2023 Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.

Fairness in AI and Its Long-Term Implications on Society

no code implementations16 Apr 2023 Ondrej Bohdal, Timothy Hospedales, Philip H. S. Torr, Fazl Barez

Successful deployment of artificial intelligence (AI) in various settings has led to numerous positive outcomes for individuals and society.

Decision Making Fairness

Exploring the Advantages of Transformers for High-Frequency Trading

1 code implementation20 Feb 2023 Fazl Barez, Paul Bilokon, Arthur Gervais, Nikita Lisitsyn

This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models.

Decoder Position +3

PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration

1 code implementation16 Mar 2022 Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E. Taylor, Wenyuan Tao, Zhen Wang, Fazl Barez

However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder the learning towards better collaboration.

Multi-agent Reinforcement Learning reinforcement-learning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.