Search Results for author: Dong Yan

Found 22 papers, 6 papers with code

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

no code implementations18 Nov 2024 Jinhao Jiang, Zhipeng Chen, Yingqian Min, Jie Chen, Xiaoxue Cheng, Jiapeng Wang, Yiru Tang, Haoxiang Sun, Jia Deng, Wayne Xin Zhao, Zheng Liu, Dong Yan, Jian Xie, Zhongyuan Wang, Ji-Rong Wen

It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model.

Mathematical Reasoning

Boosting Deductive Reasoning with Step Signals In RLHF

no code implementations12 Oct 2024 Jialian Li, Yipin Zhang, Wei Shen, Yuzi Yan, Jian Xie, Dong Yan

Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems.

Formal Logic Logical Reasoning

Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

no code implementations1 Oct 2024 Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, Yiqun Liu

Conventionally, preference data is learned and encoded into a scalar reward model that connects a value head with an LLM to produce a scalar score as preference or reward.

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

no code implementations1 Oct 2024 Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, Junge Zhang

Reward models (RM) play a critical role in aligning generations of large language models (LLM) to human expectations.

Uncertainty Quantification

Reward-Robust RLHF in LLMs

no code implementations18 Sep 2024 Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, Yuan Shen

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI).

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

no code implementations11 Jun 2024 Yuzi Yan, Yibo Miao, Jialian Li, Yipin Zhang, Jian Xie, Zhijie Deng, Dong Yan

Aligning large language models (LLMs) with human preference has recently gained tremendous attention, with the canonical yet costly RLHF-PPO and the simple and straightforward Direct Preference Optimization (DPO) as two examples.

Instruction Following Mathematical Problem-Solving

Exploring the LLM Journey from Cognition to Expression with Linear Representations

no code implementations27 May 2024 Yuzi Yan, Jialian Li, Yipin Zhang, Dong Yan

This paper presents an in-depth examination of the evolution and interplay of cognitive and expressive capabilities in large language models (LLMs), with a specific focus on Baichuan-7B and Baichuan-33B, an advanced bilingual (Chinese and English) LLM series.

Few-Shot Learning

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

1 code implementation21 May 2024 Xingzhou Lou, Junge Zhang, Jian Xie, Lifeng Liu, Dong Yan, Kaiqi Huang

Human preference alignment is critical in building powerful and reliable large language models (LLMs).

Reward Generalization in RLHF: A Topological Perspective

no code implementations15 Feb 2024 Tianyi Qiu, Fanzhi Zeng, Jiaming Ji, Dong Yan, Kaile Wang, Jiayi Zhou, Yang Han, Josef Dai, Xuehai Pan, Yaodong Yang

As a solution, we introduce a theoretical framework for investigating reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels.

Generalization Bounds Language Modelling +1

Task Aware Dreamer for Task Generalization in Reinforcement Learning

no code implementations9 Mar 2023 Chengyang Ying, Zhongkai Hao, Xinning Zhou, Hang Su, Songming Liu, Dong Yan, Jun Zhu

Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.

reinforcement-learning Reinforcement Learning +1

Model-based Reinforcement Learning with a Hamiltonian Canonical ODE Network

no code implementations2 Nov 2022 Yao Feng, Yuhong Jiang, Hang Su, Dong Yan, Jun Zhu

Model-based reinforcement learning usually suffers from a high sample complexity in training the world model, especially for the environments with complex dynamics.

Model-based Reinforcement Learning reinforcement-learning +2

On the Reuse Bias in Off-Policy Reinforcement Learning

1 code implementation15 Sep 2022 Chengyang Ying, Zhongkai Hao, Xinning Zhou, Hang Su, Dong Yan, Jun Zhu

In this paper, we reveal that the instability is also related to a new notion of Reuse Bias of IS -- the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization.

continuous-control Continuous Control +3

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

1 code implementation9 Jun 2022 Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, Jun Zhu

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation.

continuous-control Continuous Control +4

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

no code implementations13 Mar 2022 Jialian Li, Tongzheng Ren, Dong Yan, Hang Su, Jun Zhu

Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties as we need to simultaneously estimate the training environment uncertainty from samples and find the worst-case perturbation for testing.

Tianshou: a Highly Modularized Deep Reinforcement Learning Library

1 code implementation29 Jul 2021 Jiayi Weng, Huayu Chen, Dong Yan, Kaichao You, Alexis Duburcq, Minghao Zhang, Yi Su, Hang Su, Jun Zhu

In this paper, we present Tianshou, a highly modularized Python library for deep reinforcement learning (DRL) that uses PyTorch as its backend.

Deep Reinforcement Learning reinforcement-learning +1

Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk

1 code implementation ICML Workshop AML 2021 Chengyang Ying, Xinning Zhou, Dong Yan, Jun Zhu

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty caused by stochastic policies and environment variability.

continuous-control Continuous Control +4

Adaptive N-step Bootstrapping with Off-policy Data

no code implementations1 Jan 2021 Guan Wang, Dong Yan, Hang Su, Jun Zhu

In this work, we point out that the optimal value of n actually differs on each data point, while the fixed value n is a rough average of them.

Atari Games

Satellite-Terrestrial Channel Characterization in High-Speed Railway Environment at 22.6 GHz

no code implementations11 Jun 2020 Lei Ma, Ke Guan, Dong Yan, Danping He, Nuno R. Leonor, Bo Ai, Junhyeong Kim

In this paper, the satellite-terrestrial channel at 22. 6 GHz is characterized for a typical high-speed railway (HSR) environment.

Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information

no code implementations ICLR 2020 Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, Jun Zhu

In this paper, we present Lazy-CFR, a CFR algorithm that adopts a lazy update strategy to avoid traversing the whole game tree in each round.

counterfactual

Reward Shaping via Meta-Learning

no code implementations27 Jan 2019 Haosheng Zou, Tongzheng Ren, Dong Yan, Hang Su, Jun Zhu

Reward shaping is one of the most effective methods to tackle the crucial yet challenging problem of credit assignment in Reinforcement Learning (RL).

Meta-Learning Reinforcement Learning +1

Lazy-CFR: fast and near optimal regret minimization for extensive games with imperfect information

no code implementations10 Oct 2018 Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, Jun Zhu

In this paper, we present a novel technique, lazy update, which can avoid traversing the whole game tree in CFR, as well as a novel analysis on the regret of CFR with lazy update.

counterfactual

Cannot find the paper you are looking for? You can Submit a new open access paper.