1 code implementation • 12 Jun 2025 • Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
no code implementations • 10 Jun 2025 • Hao Hu, Xinqi Wang, Simon Shaolei Du
We introduce a novel task of clustering trajectories from offline reinforcement learning (RL) datasets, where each cluster center represents the policy that generated its trajectories.
1 code implementation • 9 Jun 2025 • Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.
no code implementations • 21 May 2025 • Siting Li, Xiang Gao, Simon Shaolei Du
To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9, 112 queries about diverse attributes of interest.
1 code implementation • 29 Apr 2025 • Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
We also show the critical role of promoting exploration (e. g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training.
no code implementations • 20 Apr 2025 • Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, Maryam Fazel
Personalizing large language models (LLMs) to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction.
no code implementations • CVPR 2025 • Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, Yelong Shen
However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios.
no code implementations • 13 Dec 2024 • Avinandan Bose, Zhihan Xiong, Aadirupa Saha, Simon Shaolei Du, Maryam Fazel
Our results yield improved sample efficiency of hybrid RLHF over pure offline and online exploration.
no code implementations • 7 Nov 2024 • Siting Li, Pang Wei Koh, Simon Shaolei Du
Recent research suggests that the failures of Vision-Language Models (VLMs) at visual reasoning often stem from erroneous agreements -- when semantically distinct images are ambiguously encoded by the CLIP image encoder into embeddings with high cosine similarity.
no code implementations • 7 Oct 2024 • Xiyu Zhai, Runlong Zhou, Liao Zhang, Simon Shaolei Du
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generation.
no code implementations • 2 Jul 2024 • Yifang Chen, Shuohang Wang, ZiYi Yang, Hiteshi Sharma, Nikos Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, Yelong Shen
Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current large language model pipelines, is \textit{bottlenecked by the size of human preference data}.
2 code implementations • 29 May 2024 • Yiping Wang, Yifang Chen, Wendan Yan, Alex Fang, Wenjing Zhou, Kevin Jamieson, Simon Shaolei Du
Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e. g., CLIPScore is one popular metric).
no code implementations • 19 Feb 2024 • Avinandan Bose, Simon Shaolei Du, Maryam Fazel
We study the problem of representation transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in finding a good policy for a target task.
2 code implementations • 3 Feb 2024 • Yiping Wang, Yifang Chen, Wendan Yan, Kevin Jamieson, Simon Shaolei Du
In recent years, data selection has emerged as a core issue for large-scale visual-language model pretraining, especially on noisy web-curated datasets.
1 code implementation • 30 Oct 2023 • Zhaoyi Zhou, Chuning Zhu, Runlong Zhou, Qiwen Cui, Abhishek Gupta, Simon Shaolei Du
Off-policy dynamic programming (DP) techniques such as $Q$-learning have proven to be important in sequential decision-making problems.
no code implementations • 28 Sep 2023 • Jiarui Yao, Simon Shaolei Du
Currently, reinforcement learning (RL), especially deep RL, has received more and more attention in the research area.
1 code implementation • 16 Jun 2023 • Jifan Zhang, Yifang Chen, Gregory Canal, Stephen Mussmann, Arnav M. Das, Gantavya Bhatt, Yinglun Zhu, Jeffrey Bilmes, Simon Shaolei Du, Kevin Jamieson, Robert D Nowak
Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive.
no code implementations • 13 Dec 2021 • Shusheng Xu, Yancheng Liang, Yunfei Li, Simon Shaolei Du, Yi Wu
A ubiquitous requirement in many practical reinforcement learning (RL) applications, including medical treatment, recommendation system, education and robotics, is that the deployed policy that actually interacts with the environment cannot change frequently.
no code implementations • 1 Jan 2021 • Shusheng Xu, Simon Shaolei Du, Yi Wu
We initiate the study on deep reinforcement learning problems that require low switching cost, i. e., small number of policy switches during training.
no code implementations • NeurIPS 2017 • Simon Shaolei Du, Jayanth Koushik, Aarti Singh, Barnabas Poczos
We consider the Hypothesis Transfer Learning (HTL) problem where one incorporates a hypothesis trained on the source domain into the learning procedure of the target domain.