Search Results for author: Rafael Rafailov

Found 32 papers, 16 papers with code

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though

no code implementations8 Jan 2025 Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT.

Synthetic Data Generation

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

no code implementations22 Oct 2024 Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

Others see collapse as avoidable; in an `{\it accumulate}' scenario, a sequence of models is trained, but each training uses all real and synthetic data generated so far.

Generative Reward Models

no code implementations2 Oct 2024 Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak

We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments.

reinforcement-learning Reinforcement Learning

D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

no code implementations15 Aug 2024 Rafael Rafailov, Kyle Hatch, Anikait Singh, Laura Smith, Aviral Kumar, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip Ball, Jiajun Wu, Chelsea Finn, Sergey Levine

However, evaluating progress on offline RL algorithms requires effective and challenging benchmarks that capture properties of real-world tasks, provide a range of task difficulties, and cover a range of challenges both in terms of the parameters of the domain (e. g., length of the horizon, sparsity of rewards) and the parameters of the data (e. g., narrow demonstration data or broad exploratory data).

Deep Reinforcement Learning Offline RL +1

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

2 code implementations13 Aug 2024 Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov

Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge.

Decision Making

PERSONA: A Reproducible Testbed for Pluralistic Alignment

no code implementations24 Jul 2024 Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fränken, Chelsea Finn

The rapid advancement of language models (LMs) necessitates robust alignment with diverse user values.

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

1 code implementation5 Jul 2024 Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities.

Hallucination Text-to-Image Generation

OpenVLA: An Open-Source Vision-Language-Action Model

2 code implementations13 Jun 2024 Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control.

Imitation Learning Language Modelling +1

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

no code implementations5 Jun 2024 Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, Scott Niekum

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process.

Reinforcement Learning (RL)

Scalable Ensembling For Mitigating Reward Overoptimisation

no code implementations3 Jun 2024 Ahmed M. Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, Sanmi Koyejo

Reinforcement Learning from Human Feedback (RLHF) has enabled significant advancements within language modeling for powerful, instruction-following models.

Instruction Following Language Modeling +3

Efficient Imitation Learning with Conservative World Models

no code implementations21 May 2024 Victor Kolev, Rafael Rafailov, Kyle Hatch, Jiajun Wu, Chelsea Finn

One approach to this issue is to learn a world model of the environment, and use synthetic data for policy training.

Imitation Learning Offline RL

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

1 code implementation22 Apr 2024 Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar

Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i. e., employ a "negative gradient") outperform offline and maximum likelihood objectives.

Contrastive Learning Reinforcement Learning (RL)

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

no code implementations18 Apr 2024 Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm.

Language Modeling Language Modelling +2

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

no code implementations1 Apr 2024 Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?

Image Generation

Disentangling Length from Quality in Direct Preference Optimization

1 code implementation28 Mar 2024 Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn

A number of approaches have been developed to control those biases in the classical RLHF literature, but the problem remains relatively under-explored for Direct Alignment Algorithms such as Direct Preference Optimization (DPO).

reinforcement-learning Reinforcement Learning

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

1 code implementation18 Feb 2024 Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations.

Hallucination Instruction Following +1

MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot Learning

no code implementations6 Jan 2024 Rafael Rafailov, Kyle Hatch, Victor Kolev, John D. Martin, Mariano Phielipp, Chelsea Finn

We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations in the context of realistic robot tasks.

Offline RL Robot Manipulation

Diffusion Model Alignment Using Direct Preference Optimization

1 code implementation CVPR 2024 Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences.

Contrastive Preference Learning: Learning from Human Feedback without RL

1 code implementation20 Oct 2023 Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh

Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase.

reinforcement-learning Reinforcement Learning (RL)

An Emulator for Fine-Tuning Large Language Models using Small Language Models

1 code implementation19 Oct 2023 Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Finn, Christopher D. Manning

To aid in doing so, we introduce a novel technique for decoupling the knowledge and skills gained in these two stages, enabling a direct answer to the question, "What would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)?"

Instruction Following

Contrastive Example-Based Control

1 code implementation24 Jul 2023 Kyle Hatch, Benjamin Eysenbach, Rafael Rafailov, Tianhe Yu, Ruslan Salakhutdinov, Sergey Levine, Chelsea Finn

In this paper, we propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function.

Offline RL

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

22 code implementations NeurIPS 2023 Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF).

Language Modeling Language Modelling +2

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

no code implementations24 May 2023 Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions.

TriviaQA TruthfulQA +1

Vision-Based Manipulators Need to Also See from Their Hands

no code implementations ICLR 2022 Kyle Hsu, Moo Jin Kim, Rafael Rafailov, Jiajun Wu, Chelsea Finn

We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations.

Out-of-Distribution Generalization

Visual Adversarial Imitation Learning using Variational Models

no code implementations NeurIPS 2021 Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions.

Deep Reinforcement Learning Imitation Learning +1

COMBO: Conservative Offline Model-Based Policy Optimization

4 code implementations NeurIPS 2021 Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, Chelsea Finn

We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model.

Offline RL Uncertainty Quantification

Offline Reinforcement Learning from Images with Latent Space Models

1 code implementation21 Dec 2020 Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

In this work, we build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces.

Offline RL reinforcement-learning +2

Offline Meta-Reinforcement Learning with Advantage Weighting

2 code implementations13 Aug 2020 Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, Chelsea Finn

That is, in offline meta-RL, we meta-train on fixed, pre-collected data from several tasks in order to adapt to a new task with a very small amount (less than 5 trajectories) of data from the new task.

Machine Translation Meta-Learning +6

Cannot find the paper you are looking for? You can Submit a new open access paper.