no code implementations • 24 Sep 2024 • Satvik Golechha, Dylan Cope, Nandi Schoots
An approach to improve network interpretability is via clusterability, i. e., splitting a model into disjoint clusters that can be studied independently.
1 code implementation • 9 Mar 2024 • Teun van der Weij, Massimo Poesio, Nandi Schoots
In this paper, we investigate the efficacy of activation steering for broad skills and multiple behaviours.
1 code implementation • 2 Mar 2024 • Nicholas Pochinkov, Nandi Schoots
This approach is a compute- and data-efficient method for identifying and removing neurons that enable specific behaviours.
no code implementations • 6 Dec 2023 • Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
Recent work in activation steering has demonstrated the potential to better control the outputs of Large Language Models (LLMs), but it involves finding steering vectors.
1 code implementation • 1 Nov 2023 • Hugo Fry, Seamus Fallows, Ian Fan, Jamie Wright, Nandi Schoots
We investigate the optimization target of Contrast-Consistent Search (CCS), which aims to recover the internal representations of truth of a large language model.
no code implementations • 20 Jun 2023 • Mattia Jacopo Villani, Nandi Schoots
We constructively prove that every deep ReLU network can be rewritten as a functionally identical three-layer network with weights valued in the extended reals.
no code implementations • 20 May 2023 • Nandi Schoots, Dylan Cope
We study the relationship between the entropy of intermediate representations and a model's robustness to distributional shift.
no code implementations • 30 Aug 2021 • Adam X. Yang, Maxime Robeyns, Edward Milsom, Ben Anson, Nandi Schoots, Laurence Aitchison
In particular, we show that Deep Gaussian processes (DGPs) in the Bayesian representation learning limit have exactly multivariate Gaussian posteriors, and the posterior covariances can be obtained by optimizing an interpretable objective combining a log-likelihood to improve performance with a series of KL-divergences which keep the posteriors close to the prior.
1 code implementation • 19 Apr 2021 • Dylan Cope, Nandi Schoots
We introduce two methods for improving the performance of agents meeting for the first time to accomplish a communicative task.