Search Results for author: Alexander Pan

Found 9 papers, 7 papers with code

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

no code implementations11 Dec 2024 Alexander Pan, Lijie Chen, Jacob Steinhardt

Interpretability methods seek to understand language model representations, yet the outputs of most such methods -- circuits, vectors, scalars -- are not immediately human-interpretable.

Decoder Language Modeling +1

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

1 code implementation31 Jul 2024 Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks

As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks.

General Knowledge

Feedback Loops With Language Models Drive In-Context Reward Hacking

1 code implementation9 Feb 2024 Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt

Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents.

Representation Engineering: A Top-Down Approach to AI Transparency

5 code implementations2 Oct 2023 Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience.

Question Answering

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

1 code implementation ICLR 2022 Alexander Pan, Kush Bhatia, Jacob Steinhardt

Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.

Anomaly Detection

Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training

no code implementations18 Oct 2021 Alexander Pan, Yongkyun Lee, huan zhang, Yize Chen, Yuanyuan Shi

Due to the proliferation of renewable energy and its intrinsic intermittency and stochasticity, current power systems face severe operational challenges.

Decision Making reinforcement-learning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.