Search Results for author: Simon Shaolei Du

Found 20 papers, 7 papers with code

Spurious Rewards: Rethinking Training Signals in RLVR

1 code implementation12 Jun 2025 Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, Luke Zettlemoyer

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer.

Math Mathematical Reasoning

Policy-Based Trajectory Clustering in Offline Reinforcement Learning

no code implementations10 Jun 2025 Hao Hu, Xinqi Wang, Simon Shaolei Du

We introduce a novel task of clustering trajectories from offline reinforcement learning (RL) datasets, where each cluster center represents the policy that generated its trajectories.

Clustering D4RL +6

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

1 code implementation9 Jun 2025 Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.

Multi-agent Reinforcement Learning Safety Alignment

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

no code implementations21 May 2025 Siting Li, Xiang Gao, Simon Shaolei Du

To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9, 112 queries about diverse attributes of interest.

Attribute Image Retrieval +3

LoRe: Personalizing LLMs via Low-Rank Reward Modeling

no code implementations20 Apr 2025 Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, Maryam Fazel

Personalizing large language models (LLMs) to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction.

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

no code implementations CVPR 2025 Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, Yelong Shen

However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios.

Story Completion Video Generation

On Erroneous Agreements of CLIP Image Embeddings

no code implementations7 Nov 2024 Siting Li, Pang Wei Koh, Simon Shaolei Du

Recent research suggests that the failures of Vision-Language Models (VLMs) at visual reasoning often stem from erroneous agreements -- when semantically distinct images are ambiguously encoded by the CLIP image encoder into embeddings with high cosine similarity.

Visual Reasoning

Transformers are Efficient Compilers, Provably

no code implementations7 Oct 2024 Xiyu Zhai, Runlong Zhou, Liao Zhang, Simon Shaolei Du

Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generation.

Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning

no code implementations2 Jul 2024 Yifang Chen, Shuohang Wang, ZiYi Yang, Hiteshi Sharma, Nikos Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, Yelong Shen

Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current large language model pipelines, is \textit{bottlenecked by the size of human preference data}.

Active Learning Language Modelling +2

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

2 code implementations29 May 2024 Yiping Wang, Yifang Chen, Wendan Yan, Alex Fang, Wenjing Zhou, Kevin Jamieson, Simon Shaolei Du

Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e. g., CLIPScore is one popular metric).

Contrastive Learning Language Modelling

Offline Multi-task Transfer RL with Representational Penalization

no code implementations19 Feb 2024 Avinandan Bose, Simon Shaolei Du, Maryam Fazel

We study the problem of representation transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in finding a good policy for a target task.

Offline RL Reinforcement Learning (RL)

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

2 code implementations3 Feb 2024 Yiping Wang, Yifang Chen, Wendan Yan, Kevin Jamieson, Simon Shaolei Du

In recent years, data selection has emerged as a core issue for large-scale visual-language model pretraining, especially on noisy web-curated datasets.

Contrastive Learning Experimental Design +1

Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning

1 code implementation30 Oct 2023 Zhaoyi Zhou, Chuning Zhu, Runlong Zhou, Qiwen Cui, Abhishek Gupta, Simon Shaolei Du

Off-policy dynamic programming (DP) techniques such as $Q$-learning have proven to be important in sequential decision-making problems.

Decision Making Offline RL +2

Robust Offline Reinforcement Learning -- Certify the Confidence Interval

no code implementations28 Sep 2023 Jiarui Yao, Simon Shaolei Du

Currently, reinforcement learning (RL), especially deep RL, has received more and more attention in the research area.

reinforcement-learning Reinforcement Learning +1

A Benchmark for Low-Switching-Cost Reinforcement Learning

no code implementations13 Dec 2021 Shusheng Xu, Yancheng Liang, Yunfei Li, Simon Shaolei Du, Yi Wu

A ubiquitous requirement in many practical reinforcement learning (RL) applications, including medical treatment, recommendation system, education and robotics, is that the deployed policy that actually interacts with the environment cannot change frequently.

Atari Games reinforcement-learning +2

Deep Q-Learning with Low Switching Cost

no code implementations1 Jan 2021 Shusheng Xu, Simon Shaolei Du, Yi Wu

We initiate the study on deep reinforcement learning problems that require low switching cost, i. e., small number of policy switches during training.

Atari Games Deep Reinforcement Learning +3

Hypothesis Transfer Learning via Transformation Functions

no code implementations NeurIPS 2017 Simon Shaolei Du, Jayanth Koushik, Aarti Singh, Barnabas Poczos

We consider the Hypothesis Transfer Learning (HTL) problem where one incorporates a hypothesis trained on the source domain into the learning procedure of the target domain.

Transfer Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.