1 code implementation • 11 Sep 2024 • Wei Shen, Chuheng Zhang
Reinforcement learning from human feedback (RLHF) is one of the key techniques that helps large language models (LLMs) to follow instructions and provide helpful and harmless responses.
1 code implementation • 20 Jul 2024 • Yunseon Choi, Sangmin Bae, Seonghyun Ban, Minchan Jeong, Chuheng Zhang, Lei Song, Li Zhao, Jiang Bian, Kee-Eung Kim
With the advent of foundation models, prompt tuning has positioned itself as an important technique for directing model behaviors and eliciting desired responses.
no code implementations • 17 Apr 2024 • Guangran Cheng, Chuheng Zhang, Wenzhe Cai, Li Zhao, Changyin Sun, Jiang Bian
While large language models (LLMs) are successful in completing various language processing tasks, they easily fail to interact with the physical world by generating control sequences properly.
no code implementations • 23 Mar 2024 • YiWen Chen, Yuyao Ye, Ziyi Chen, Chuheng Zhang, Marcelo H. Ang
Robotics learning highly relies on human expertise and efforts, such as demonstrations, design of reward functions in reinforcement learning, performance evaluation using human feedback, etc.
no code implementations • 6 Aug 2023 • Lei Song, Chuheng Zhang, Li Zhao, Jiang Bian
2)~How well can GPT-4 generalize to different scenarios for HVAC control?
1 code implementation • 13 Jun 2023 • Xianliang Yang, Zhihao Liu, Wei Jiang, Chuheng Zhang, Li Zhao, Lei Song, Jiang Bian
Multi-agent reinforcement learning (MARL) models multiple agents that interact and learn within a shared environment.
no code implementations • 12 May 2023 • Chuheng Zhang, Yitong Duan, Xiaoyu Chen, Jianyu Chen, Jian Li, Li Zhao
To evaluate our algorithms, we also implement a carefully designed simulator based on historical limit order book (LOB) data to provide a high-fidelity benchmark for different algorithms.
no code implementations • 3 Mar 2023 • Yuanying Cai, Chuheng Zhang, Wei Shen, Xuyun Zhang, Wenjie Ruan, Longbo Huang
Inspired by the recent success of sequence modeling in RL and the use of masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory.
no code implementations • 15 Dec 2022 • Yuandong Ding, Mingxiao Feng, Guozi Liu, Wei Jiang, Chuheng Zhang, Li Zhao, Lei Song, Houqiang Li, Yan Jin, Jiang Bian
In this paper, we consider the inventory management (IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand.
no code implementations • 5 Dec 2022 • Wei Shen, Xiaonan He, Chuheng Zhang, Xuyun Zhang, Jian Xie
Moreover, they are trained and evaluated on the benchmark datasets with adequate labels, which are expensive to obtain in a commercial dialogue system.
1 code implementation • 5 Dec 2022 • Yuanying Cai, Chuheng Zhang, Li Zhao, Wei Shen, Xuyun Zhang, Lei Song, Jiang Bian, Tao Qin, TieYan Liu
There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies.
no code implementations • 2 Apr 2022 • Ze Wang, Guogang Liao, Xiaowen Shi, Xiaoxu Wu, Chuheng Zhang, Yongkang Wang, Xingxing Wang, Dong Wang
With the recent prevalence of reinforcement learning (RL), there have been tremendous interests in utilizing RL for ads allocation in recommendation platforms (e. g., e-commerce and news feed sites).
no code implementations • 2 Apr 2022 • Ze Wang, Guogang Liao, Xiaowen Shi, Xiaoxu Wu, Chuheng Zhang, Bingqi Zhu, Yongkang Wang, Xingxing Wang, Dong Wang
Ads allocation, which involves allocating ads and organic items to limited slots in feed with the purpose of maximizing platform revenue, has become a research hotspot.
no code implementations • 1 Apr 2022 • Guogang Liao, Xiaowen Shi, Ze Wang, Xiaoxu Wu, Chuheng Zhang, Yongkang Wang, Xingxing Wang, Dong Wang
A mixed list of ads and organic items is usually displayed in feed and how to allocate the limited slots to maximize the overall revenue is a key problem.
1 code implementation • 9 Sep 2021 • Guogang Liao, Ze Wang, Xiaoxu Wu, Xiaowen Shi, Chuheng Zhang, Yongkang Wang, Xingxing Wang, Dong Wang
Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments.
2 code implementations • 25 Aug 2021 • Wei Shen, Chuheng Zhang, Yun Tian, Liang Zeng, Xiaonan He, Wanchun Dou, Xiaolong Xu
However, without node content (i. e., side information) for training, the user (or item) specific representation can not be learned in the inductive setting, that is, a model trained on one group of users (or items) cannot adapt to new users (or items).
Ranked #3 on Recommendation Systems on MovieLens 1M
no code implementations • ICLR 2021 • Guoqing Liu, Chuheng Zhang, Li Zhao, Tao Qin, Jinhua Zhu, Jian Li, Nenghai Yu, Tie-Yan Liu
Recently, various auxiliary tasks have been proposed to accelerate representation learning and improve sample efficiency in deep reinforcement learning (RL).
1 code implementation • 3 Oct 2020 • Chuheng Zhang, Yuanqi Li, Xi Chen, Yifei Jin, Pingzhong Tang, Jian Li
Modern machine learning models (such as deep neural networks and boosting decision tree models) have become increasingly popular in financial market prediction, due to their superior capacity to extract complex non-linear patterns.
no code implementations • 25 Aug 2020 • Wei Shen, Xiaonan He, Chuheng Zhang, Qiang Ni, Wanchun Dou, Yan Wang
Therefore, it is crucial to design a participant selection algorithm that applies to different MCS systems to achieve multiple goals.
no code implementations • 11 Jun 2020 • Chuheng Zhang, Yuanying Cai, Longbo Huang, Jian Li
In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment.
no code implementations • 27 May 2019 • Chuheng Zhang, Yuanqi Li, Jian Li
We observe that several existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic (even in some very simple environments), leading to an unstable training process.