Search Results for author: Jacob Pfau

Found 6 papers, 4 papers with code

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

1 code implementation • 24 Apr 2024 • Jacob Pfau, William Merrill, Samuel R. Bowman

We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula.

Paper
Code

Self-Consistency of Large Language Models under Ambiguity

1 code implementation • 20 Oct 2023 • Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, Jacob Pfau

Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers.

Question Answering

Paper
Code

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.

reinforcement-learning

Paper
Add Code

Goal Misgeneralization in Deep Reinforcement Learning

4 code implementations • 28 May 2021 • Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger

We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL).

Navigate Out-of-Distribution Generalization +2

Paper
Code

Robust Semantic Interpretability: Revisiting Concept Activation Vectors

1 code implementation • 6 Apr 2021 • Jacob Pfau, Albert T. Young, Jerome Wei, Maria L. Wei, Michael J. Keiser

Our proposed Robust Concept Activation Vectors (RCAV) quantifies the effects of semantic concepts on individual model predictions and on model behavior as a whole.

Benchmarking counterfactual +1

Paper
Code

Global Saliency: Aggregating Saliency Maps to Assess Dataset Artefact Bias

no code implementations • 16 Oct 2019 • Jacob Pfau, Albert T. Young, Maria L. Wei, Michael J. Keiser

In high-stakes applications of machine learning models, interpretability methods provide guarantees that models are right for the right reasons.

Semantic Segmentation

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.