Search Results for author: Anikait Singh

Found 17 papers, 7 papers with code

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

1 code implementation3 Mar 2025 Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman

In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance.

Reinforcement Learning (RL)

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

no code implementations26 Feb 2025 Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn

Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

In-Context Learning Meta-Learning +1

Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models

1 code implementation24 Feb 2025 Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, Nick Haber

However, existing open math datasets either contain a small collection of high-quality, human-written problems or a large corpus of machine-generated problems of uncertain quality, forcing researchers to choose between quality and quantity.

GSM8K Math +2

Personalized Preference Fine-tuning of Diffusion Models

no code implementations CVPR 2025 Meihua Dang, Anikait Singh, Linqi Zhou, Stefano Ermon, Jiaming Song

With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users.

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

no code implementations8 Jan 2025 Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT.

Synthetic Data Generation

Test-Time Alignment via Hypothesis Reweighting

no code implementations11 Dec 2024 Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn

Large pretrained models often struggle with underspecified tasks -- situations where the training data does not fully define the desired behavior.

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

no code implementations3 Oct 2024 Rohin Manvi, Anikait Singh, Stefano Ermon

We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance.

GSM8K Math

D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

no code implementations15 Aug 2024 Rafael Rafailov, Kyle Hatch, Anikait Singh, Laura Smith, Aviral Kumar, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip Ball, Jiajun Wu, Chelsea Finn, Sergey Levine

However, evaluating progress on offline RL algorithms requires effective and challenging benchmarks that capture properties of real-world tasks, provide a range of task difficulties, and cover a range of challenges both in terms of the parameters of the domain (e. g., length of the horizon, sparsity of rewards) and the parameters of the data (e. g., narrow demonstration data or broad exploratory data).

Deep Reinforcement Learning Offline RL +1

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

1 code implementation22 Apr 2024 Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar

Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i. e., employ a "negative gradient") outperform offline and maximum likelihood objectives.

Contrastive Learning Reinforcement Learning (RL)

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

no code implementations22 Sep 2023 Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, Aviral Kumar

Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly.

Offline RL Reinforcement Learning (RL)

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning

3 code implementations NeurIPS 2023 Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, Sergey Levine

Our approach, calibrated Q-learning (Cal-QL), accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Q-values are at a reasonable scale.

Offline RL Q-Learning +1

Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints

no code implementations2 Nov 2022 Anikait Singh, Aviral Kumar, Quan Vuong, Yevgen Chebotar, Sergey Levine

Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy to the same extent across the state space.

Atari Games Offline RL +2

Pre-Training for Robots: Offline RL Enables Learning New Tasks from a Handful of Trials

1 code implementation11 Oct 2022 Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, Sergey Levine

To our knowledge, PTR is the first RL method that succeeds at learning new tasks in a new domain on a real WidowX robot with as few as 10 task demonstrations, by effectively leveraging an existing dataset of diverse multi-task robot data collected in a variety of toy kitchens.

Offline RL Q-Learning +1

When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?

no code implementations12 Apr 2022 Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine

To answer this question, we characterize the properties of environments that allow offline RL methods to perform better than BC methods, even when only provided with expert data.

Atari Games Diagnostic +5

Should I Run Offline Reinforcement Learning or Behavioral Cloning?

no code implementations ICLR 2022 Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine

In this paper, our goal is to characterize environments and dataset compositions where offline RL leads to better performance than BC.

Atari Games Diagnostic +5

A Workflow for Offline Model-Free Robotic Reinforcement Learning

1 code implementation22 Sep 2021 Aviral Kumar, Anikait Singh, Stephen Tian, Chelsea Finn, Sergey Levine

To this end, we devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance.

Offline RL reinforcement-learning +2

Cannot find the paper you are looking for? You can Submit a new open access paper.