Search Results for author: Alex D'Amour

Found 4 papers, 1 papers with code

Transforming and Combining Rewards for Aligning Large Language Models

no code implementations1 Feb 2024 ZiHao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model.

Language Modelling

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

no code implementations14 Dec 2023 Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant

However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.

Language Modelling

Detecting Underspecification with Local Ensembles

2 code implementations ICLR 2020 David Madras, James Atwood, Alex D'Amour

We present local ensembles, a method for detecting underspecification -- when many possible predictors are consistent with the training data and model class -- at test time in a pre-trained model.

Active Learning Out-of-Distribution Detection

BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity

no code implementations17 Dec 2018 Alexey A. Gritsenko, Alex D'Amour, James Atwood, Yoni Halpern, D. Sculley

We introduce the BriarPatch, a pixel-space intervention that obscures sensitive attributes from representations encoded in pre-trained classifiers.

Cannot find the paper you are looking for? You can Submit a new open access paper.