no code implementations • 30 May 2025 • Narmeen Oozeer, Luke Marks, Fazl Barez, Amirali Abdullah
Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning.
no code implementations • 20 May 2025 • Maheep Chaudhary, Fazl Barez
We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers.
no code implementations • 18 Apr 2025 • Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, Neel Nanda
In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL).
no code implementations • 27 Feb 2025 • Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost
We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs.
no code implementations • 19 Feb 2025 • Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade--Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami, Usman Gohar, Ben Huang, Supheakmungkol Sarin, Elie Alhajjar, Canyu Chen, Roman Eng, Kashyap Ramanandula Manjusha, Virendra Mehta, Eileen Long, Murali Emani, Natan Vidra, Benjamin Rukundo, Abolfazl Shahbazi, Kongtao Chen, Rajat Ghosh, Vithursan Thangarasa, Pierre Peigné, Abhinav Singh, Max Bartolo, Satyapriya Krishna, Mubashara Akhtar, Rafael Gold, Cody Coleman, Luis Oala, Vassil Tashev, Joseph Marvin Imperial, Amy Russ, Sasidhar Kunapuli, Nicolas Miailhe, Julien Delaunay, Bhaktipriya Radharapu, Rajat Shinde, Tuesday, Debojyoti Dutta, Declan Grabb, Ananya Gangavarapu, Saurav Sahay, Agasthya Gangavarapu, Patrick Schramowski, Stephen Singam, Tom David, Xudong Han, Priyanka Mary Mammen, Tarunima Prabhakar, Venelin Kovatchev, Rebecca Weiss, Ahmed Ahmed, Kelvin N. Manyeki, Sandeep Madireddy, Foutse khomh, Fedor Zhdanov, Joachim Baumann, Nina Vasan, Xianjun Yang, Carlos Mougn, Jibin Rajan Varghese, Hussain Chinoy, Seshakrishna Jitendar, Manil Maskey, Claire V. Hardgrove, TianHao Li, Aakash Gupta, Emil Joswin, Yifan Mai, Shachi H Kumar, Cigdem Patlak, Kevin Lu, Vincent Alessi, Sree Bhargavi Balija, Chenhe Gu, Robert Sullivan, James Gealy, Matt Lavrisa, James Goel, Peter Mattson, Percy Liang, Joaquin Vanschoren
This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories.
no code implementations • 13 Jan 2025 • Michal Bravansky, Filip Trhlik, Fazl Barez
As general-purpose artificial intelligence (AI) systems become increasingly integrated with diverse human communities, cultural alignment has emerged as a crucial element in their deployment.
no code implementations • 9 Jan 2025 • Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount.
1 code implementation • 4 Dec 2024 • John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma
We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3. 5 Sonnet when sampling 10, 000 augmented prompts.
no code implementations • 3 Dec 2024 • Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem.
1 code implementation • 2 Nov 2024 • Luke Marks, Alasdair Paren, David Krueger, Fazl Barez
By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data.
1 code implementation • 11 Oct 2024 • Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks.
1 code implementation • 9 Oct 2024 • Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez
Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions.
1 code implementation • 9 Oct 2024 • Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez
Providing evidence for this hypothesis would enable researchers to exploit universal properties, facilitating the generalization of mechanistic interpretability techniques across models.
1 code implementation • 14 Jun 2024 • Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger
We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments.
no code implementations • 14 May 2024 • Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Aaron Purewal, Csaba Botos, Fabro Steibel, FAZEL KESHTKAR, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Imperial, Juan Arturo Nolazco, Lori Landay, Matthew Jackson, Phillip H. S. Torr, Trevor Darrell, Yong Lee, Jakob Foerster
Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education.
no code implementations • 10 May 2024 • Nevan Wichers, Victor Tao, Riccardo Volpato, Fazl Barez
Our goal is to visualize what environment states the networks are representing.
no code implementations • 25 Apr 2024 • Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster
In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education.
no code implementations • 23 Feb 2024 • Clement Neo, Shay B. Cohen, Fazl Barez
Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications.
2 code implementations • 4 Feb 2024 • Philip Quirke, Clement Neo, Fazl Barez
To exhibit the reusability of verified modules, we insert the trained integer addition model into a larger untrained model and train the combined model to perform both addition and subtraction.
1 code implementation • 10 Jan 2024 • Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
1 code implementation • 3 Jan 2024 • Michelle Lo, Shay B. Cohen, Fazl Barez
This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons.
no code implementations • 23 Dec 2023 • Fazl Barez, Philip Torr
As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical.
1 code implementation • 7 Nov 2023 • Michael Lan, Philip Torr, Fazl Barez
We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months.
4 code implementations • 19 Oct 2023 • Philip Quirke, Fazl Barez
Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use.
1 code implementation • 12 Oct 2023 • Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, Fazl Barez
Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe's predictions.
no code implementations • 9 Oct 2023 • Kayla Matteucci, Shahar Avin, Fazl Barez, Seán Ó hÉigeartaigh
Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning.
1 code implementation • 3 Oct 2023 • Albert Garde, Esben Kran, Fazl Barez
By granting access to state-of-the-art interpretability methods, DeepDecipher makes LLMs more transparent, trustworthy, and safe.
1 code implementation • 31 May 2023 • Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez
Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.
1 code implementation • 27 May 2023 • Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, Fazl Barez
We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity.
1 code implementation • 24 May 2023 • Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen
Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming.
no code implementations • 23 Apr 2023 • Fazl Barez, Hosien Hasanbieg, Alesandro Abbate
We evaluate the satisfaction of these constraints via p-norms in state vector space.
no code implementations • 22 Apr 2023 • Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez
Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.
no code implementations • 16 Apr 2023 • Ondrej Bohdal, Timothy Hospedales, Philip H. S. Torr, Fazl Barez
Successful deployment of artificial intelligence (AI) in various settings has led to numerous positive outcomes for individuals and society.
1 code implementation • 20 Feb 2023 • Fazl Barez, Paul Bilokon, Arthur Gervais, Nikita Lisitsyn
This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models.
1 code implementation • 16 Mar 2022 • Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E. Taylor, Wenyuan Tao, Zhen Wang, Fazl Barez
However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder the learning towards better collaboration.
Multi-agent Reinforcement Learning
reinforcement-learning
+1