Search Results for author: Rylan Schaeffer

Found 30 papers, 2 papers with code

Position: Model Collapse Does Not Mean What You Think

no code implementations5 Mar 2025 Rylan Schaeffer, Joshua Kazdan, Alvan Caleb Arulandu, Sanmi Koyejo

To assess how significantly different interpretations of model collapse threaten future generative models, we posit what we believe are realistic conditions for studying model collapse and then conduct a rigorous assessment of the literature's methodologies through this lens.

Position

No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data

no code implementations26 Feb 2025 Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Krishnamurthy Dvijotham

Against open-source models protected by simple defenses, we improve ASRs by an average of 3. 25 times compared to the best performing previous attacks that use only harmless data.

Data Poisoning

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

no code implementations24 Feb 2025 Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S. Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo, Sharan Narang

The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy.

2k ARC +1

How Do Large Language Monkeys Get Their Power (Laws)?

no code implementations24 Feb 2025 Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts.

Language Modeling Language Modelling +1

Best-of-N Jailbreaking

1 code implementation4 Dec 2024 John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3. 5 Sonnet when sampling 10, 000 augmented prompts.

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

no code implementations3 Dec 2024 Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem.

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

no code implementations23 Oct 2024 Elyas Obbad, Iddah Mlauzi, Brando Miranda, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo

Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution.

Code Generation Domain Adaptation

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

no code implementations22 Oct 2024 Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

Others see collapse as avoidable; in an `{\it accumulate}' scenario, a sequence of models is trained, but each training uses all real and synthetic data generated so far.

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

no code implementations21 Jul 2024 Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez

These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

Instruction Following Language Modelling +1

Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models

no code implementations20 Jun 2024 Sunny Duan, Mikail Khona, Abhiram Iyer, Rylan Schaeffer, Ila R Fiete

Frontier AI systems are making transformative impacts across society, but such benefits are not without costs: models trained on web-scale datasets containing personal and private data raise profound concerns about data privacy and security.

Diagnostic Memorization

In-Context Learning of Energy Functions

no code implementations18 Jun 2024 Rylan Schaeffer, Mikail Khona, Sanmi Koyejo

In-context learning is a powerful capability of certain machine learning models that arguably underpins the success of today's frontier AI models.

In-Context Learning Language Modeling +1

Quantifying Variance in Evaluation Benchmarks

no code implementations14 Jun 2024 Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes

Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities.

MMLU

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

no code implementations13 Jun 2024 Rylan Schaeffer, Victor Lecomte, Dhruv Bhandarkar Pai, Andres Carranza, Berivan Isik, Alyssa Unell, Mikail Khona, Thomas Yerxa, Yann Lecun, SueYeon Chung, Andrey Gromov, Ravid Shwartz-Ziv, Sanmi Koyejo

We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL.

Self-Supervised Learning

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

no code implementations6 Jun 2024 Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale.

Multiple-choice Question Answering

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

no code implementations1 Apr 2024 Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?

Image Generation

Bridging Associative Memory and Probabilistic Modeling

no code implementations15 Feb 2024 Rylan Schaeffer, Nika Zahedi, Mikail Khona, Dhruv Pai, Sang Truong, Yilun Du, Mitchell Ostrow, Sarthak Chandra, Andres Carranza, Ila Rani Fiete, Andrey Gromov, Sanmi Koyejo

Based on the observation that associative memory's energy functions can be seen as probabilistic modeling's negative log likelihoods, we build a bridge between the two that enables useful flow of ideas in both directions.

In-Context Learning

Disentangling Fact from Grid Cell Fiction in Trained Deep Path Integrators

no code implementations6 Dec 2023 Rylan Schaeffer, Mikail Khona, Sanmi Koyejo, Ila Rani Fiete

Work on deep learning-based models of grid cells suggests that grid cells generically and robustly arise from optimizing networks to path integrate, i. e., track one's spatial position by integrating self-velocity signals.

What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes

no code implementations5 Dec 2023 Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, Sanmi Koyejo

Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap.

Testing Assumptions Underlying a Unified Theory for the Origin of Grid Cells

no code implementations27 Nov 2023 Rylan Schaeffer, Mikail Khona, Adrian Bertagnoli, Sanmi Koyejo, Ila Rani Fiete

At both the population and single-cell levels, we find evidence suggesting that neither of the assumptions are likely true in biological neural representations.

Pretraining on the Test Set Is All You Need

no code implementations13 Sep 2023 Rylan Schaeffer

Inspired by recent work demonstrating the promise of smaller Transformer-based language models pretrained on carefully curated data, we supercharge such approaches by investing heavily in curating a novel, high quality, non-synthetic data mixture based solely on evaluation benchmarks.

All

FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation

no code implementations20 Jul 2023 Dhruv Pai, Andres Carranza, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks.

Anomaly Detection

Deceptive Alignment Monitoring

no code implementations20 Jul 2023 Andres Carranza, Dhruv Pai, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

As the capabilities of large machine learning models continue to grow, and as the autonomy afforded to such models continues to expand, the spectre of a new adversary looms: the models themselves.

Safety Alignment

Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting

no code implementations20 Jul 2023 Rylan Schaeffer, Kateryna Pistunova, Samar Khanna, Sarthak Consul, Sanmi Koyejo

We find that the logically \textit{invalid} reasoning prompts do indeed achieve similar performance gains on BBH tasks as logically valid reasoning prompts.

Language Modeling Language Modelling +1

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

no code implementations NeurIPS 2023 Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly.

Adversarial Robustness Ethics +1

Are Emergent Abilities of Large Language Models a Mirage?

no code implementations NeurIPS 2023 Rylan Schaeffer, Brando Miranda, Sanmi Koyejo

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models.

Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

1 code implementation24 Mar 2023 Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W. Rocks, Ila Rani Fiete, Oluwasanmi Koyejo

Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime.

Learning Theory regression

Streaming Inference for Infinite Non-Stationary Clustering

no code implementations2 May 2022 Rylan Schaeffer, Gabrielle Kaili-May Liu, Yilun Du, Scott Linderman, Ila Rani Fiete

Learning from a continuous stream of non-stationary data in an unsupervised manner is arguably one of the most common and most challenging settings facing intelligent agents.

Clustering Variational Inference

An Algorithmic Theory of Metacognition in Minds and Machines

no code implementations5 Nov 2021 Rylan Schaeffer

To the machine learning community, our proposed theory creates a novel interaction between the Actor and Critic in Actor-Critic agents and notes a novel connection between RL and Bayesian Optimization.

Bayesian Optimization Reinforcement Learning (RL)

Cannot find the paper you are looking for? You can Submit a new open access paper.