Search Results for author: Alexander Pan

Found 6 papers, 4 papers with code

Feedback Loops With Language Models Drive In-Context Reward Hacking

1 code implementation9 Feb 2024 Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt

Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents.

Representation Engineering: A Top-Down Approach to AI Transparency

1 code implementation2 Oct 2023 Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience.

Question Answering

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

1 code implementation ICLR 2022 Alexander Pan, Kush Bhatia, Jacob Steinhardt

Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.

Anomaly Detection

Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training

no code implementations18 Oct 2021 Alexander Pan, Yongkyun Lee, huan zhang, Yize Chen, Yuanyuan Shi

Due to the proliferation of renewable energy and its intrinsic intermittency and stochasticity, current power systems face severe operational challenges.

Decision Making reinforcement-learning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.