no code implementations • 11 Mar 2025 • Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs).
no code implementations • 4 Jan 2025 • Kanefumi Matsuyama, Kefan Su, Jiangxing Wang, Deheng Ye, Zongqing Lu
To this end, we propose a hierarchical MARL approach to enable generalizable cooperation via role diversity, namely CORD.
1 code implementation • 1 Dec 2024 • Mingyu Yang, Junyou Li, Zhongbin Fang, Sheng Chen, Yangbin Yu, Qiang Fu, Wei Yang, Deheng Ye
In recent years, Artificial Intelligence Generated Content (AIGC) has advanced from text-to-image generation to text-to-video and multimodal video synthesis.
1 code implementation • 18 Nov 2024 • Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Bo Du, DaCheng Tao
Experimental results demonstrate that SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations, underscoring its robust step generalization capabilities.
1 code implementation • 23 Oct 2024 • Yao Tang, Zhihui Xie, Zichuan Lin, Deheng Ye, Shuai Li
Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills.
no code implementations • 9 Oct 2024 • Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, Chengqi Zhang
The resulting world model is composed of the LLM and the learned rules.
1 code implementation • NeurIPS 2023 • Yun Qu, Boyuan Wang, Jianzhun Shao, Yuhang Jiang, Chen Chen, Zhenbin Ye, Lin Liu, Junfeng Yang, Lin Lai, Hongyang Qin, Minwen Deng, Juchao Zhuo, Deheng Ye, Qiang Fu, Wei Yang, Guang Yang, Lanxiao Huang, Xiangyang Ji
The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications.
no code implementations • 2 Aug 2024 • Ruize Zhang, Zelai Xu, Chengdong Ma, Chao Yu, Wei-Wei Tu, Wenhao Tang, Shiyu Huang, Deheng Ye, Wenbo Ding, Yaodong Yang, Yu Wang
Self-play, characterized by agents' interactions with copies or past versions of themselves, has recently gained prominence in reinforcement learning (RL).
Multi-agent Reinforcement Learning
reinforcement-learning
+3
1 code implementation • 4 Jul 2024 • Fuxiang Zhang, Junyou Li, Yi-Chen Li, Zongzhang Zhang, Yang Yu, Deheng Ye
In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation.
no code implementations • 5 Mar 2024 • Liangzhou Wang, Kaiwen Zhu, Fengming Zhu, Xinghu Yao, Shujie Zhang, Deheng Ye, Haobo Fu, Qiang Fu, Wei Yang
The common goal is an achievable state with high value, which is obtained by sampling from the distribution of future states.
1 code implementation • 3 Feb 2024 • Yangbin Yu, Qin Zhang, Junyou Li, Qiang Fu, Deheng Ye
The emergence of large language models (LLMs) has significantly advanced the simulation of believable interactive agents.
2 code implementations • 3 Feb 2024 • Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye
We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated.
no code implementations • 18 Jan 2024 • He Zhao, Zhiwei Zeng, Yongwei Wang, Deheng Ye, Chunyan Miao
Heterogeneous Graph Neural Networks (HGNNs) are increasingly recognized for their performance in areas like the web and e-commerce, where resilience against adversarial attacks is crucial.
1 code implementation • 20 Nov 2023 • Tiantian Zhang, Kevin Zehua Shen, Zichuan Lin, Bo Yuan, Xueqian Wang, Xiu Li, Deheng Ye
On the other hand, offline learning on replayed tasks while learning a new task may induce a distributional shift between the dataset and the learned policy on old tasks, resulting in forgetting.
1 code implementation • 23 Oct 2023 • Yihuai Lan, Zhiqiang Hu, Lei Wang, Yang Wang, Deheng Ye, Peilin Zhao, Ee-Peng Lim, Hui Xiong, Hao Wang
This paper explores the open research problem of understanding the social behaviors of LLM-based agents.
1 code implementation • 24 Aug 2023 • Hanchi Huang, Li Shen, Deheng Ye, Wei Liu
We propose a novel master-slave architecture to solve the top-$K$ combinatorial multi-armed bandits problem with non-linear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback.
1 code implementation • 10 Jul 2023 • Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, Deheng Ye
The goal of program synthesis, or code generation, is to generate executable code based on given descriptions.
1 code implementation • 26 May 2023 • Zhihui Xie, Zichuan Lin, Deheng Ye, Qiang Fu, Wei Yang, Shuai Li
While promising, return conditioning is limited to training data labeled with rewards and therefore faces challenges in learning from unsupervised data.
no code implementations • 13 Mar 2023 • Ziniu Li, Ke Xu, Liu Liu, Lanqing Li, Deheng Ye, Peilin Zhao
To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase.
1 code implementation • 5 Feb 2023 • Zichuan Lin, Xiapeng Wu, Mingfei Sun, Deheng Ye, Qiang Fu, Wei Yang, Wei Liu
Recent success in Deep Reinforcement Learning (DRL) methods has shown that policy optimization with respect to an off-policy distribution via importance sampling is effective for sample reuse.
no code implementations • 20 Jan 2023 • Haoxuan Pan, Deheng Ye, Xiaoming Duan, Qiang Fu, Wei Yang, Jianping He, Mingfei Sun
We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization.
no code implementations • 8 Jan 2023 • Wenzhe Li, Hao Luo, Zichuan Lin, Chongjie Zhang, Zongqing Lu, Deheng Ye
Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings.
1 code implementation • 4 Dec 2022 • Boxuan Zhao, Jun Zhang, Deheng Ye, Jian Cao, Xiao Han, Qiang Fu, Wei Yang
Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification.
no code implementations • 8 Nov 2022 • Zhihui Xie, Zichuan Lin, Junyou Li, Shuai Li, Deheng Ye
The past few years have seen rapid progress in combining reinforcement learning (RL) with deep learning.
1 code implementation • 7 Nov 2022 • Hanchi Huang, Deheng Ye, Li Shen, Wei Liu
To mitigate the negative influence of customizing the one-off training order in curriculum-based AMTL, CAMRL switches its training mode between parallel single-task RL and asymmetric multi-task RL (MTRL), according to an indicator regarding the training time, the overall performance, and the performance gap among tasks.
1 code implementation • 19 Oct 2022 • Chengqian Gao, Ke Xu, Liu Liu, Deheng Ye, Peilin Zhao, Zhiqiang Xu
A promising paradigm for offline reinforcement learning (RL) is to constrain the learned policy to stay close to the dataset behaviors, known as policy constraint offline RL.
1 code implementation • 26 Sep 2022 • Jiangxing Wang, Deheng Ye, Zongqing Lu
To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution.
1 code implementation • 21 Sep 2022 • Haibin Zhou, Tong Wei, Zichuan Lin, Junyou Li, Junliang Xing, Yuanchun Shi, Li Shen, Chao Yu, Deheng Ye
We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space.
1 code implementation • 18 Sep 2022 • Hua Wei, Jingxiao Chen, Xiyang Ji, Hongyang Qin, Minwen Deng, Siqin Li, Liang Wang, Weinan Zhang, Yong Yu, Lin Liu, Lanxiao Huang, Deheng Ye, Qiang Fu, Wei Yang
Compared to other environments studied in most previous work, ours presents new generalization challenges for competitive reinforcement learning.
no code implementations • 1 Sep 2022 • Tiantian Zhang, Zichuan Lin, Yuxing Wang, Deheng Ye, Qiang Fu, Wei Yang, Xueqian Wang, Bin Liang, Bo Yuan, Xiu Li
A key challenge of continual reinforcement learning (CRL) in dynamic environments is to promptly adapt the RL agent's behavior as the environment changes over its lifetime, while minimizing the catastrophic forgetting of the learned information.
no code implementations • 11 Aug 2022 • Ke Xu, Jianqiao Wangni, Yifan Zhang, Deheng Ye, Jiaxiang Wu, Peilin Zhao
Therefore, a threshold quantization strategy with a relatively small error is adopted in QCMD adagrad and QRDA adagrad to improve the signal-to-noise ratio and preserve the sparsity of the model.
no code implementations • 12 May 2022 • Qianggang Ding, Deheng Ye, Tingyang Xu, Peilin Zhao
To the best of our knowledge, our method is the first GNN-based bilevel optimization framework for resolving this task.
no code implementations • 17 Feb 2022 • Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, Weijun Hong, Zhongyue Huang, Haicheng Chen, Guangjun Zeng, Yue Lin, Vincent Micheli, Eloi Alonso, François Fleuret, Alexander Nikulin, Yury Belousov, Oleg Svidchenko, Aleksei Shpilman
With this in mind, we hosted the third edition of the MineRL ObtainDiamond competition, MineRL Diamond 2021, with a separate track in which we permitted any solution to promote the participation of newcomers.
no code implementations • 7 Dec 2021 • Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang
To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration.
Efficient Exploration
Hierarchical Reinforcement Learning
+6
1 code implementation • NeurIPS 2021 • Zifan Wu, Chao Yu, Deheng Ye, Junge Zhang, Haiyin Piao, Hankz Hankui Zhuo
We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting.
no code implementations • NeurIPS 2021 • Yiming Gao, Bei Shi, Xueying Du, Liang Wang, Guangwei Chen, Zhenjie Lian, Fuhao Qiu, Guoan Han, Weixuan Wang, Deheng Ye, Qiang Fu, Wei Yang, Lanxiao Huang
Recently, many researchers have made successful progress in building the AI systems for MOBA-game-playing with deep reinforcement learning, such as on Dota 2 and Honor of Kings.
1 code implementation • 9 Oct 2021 • Shiyu Huang, Wenze Chen, Longfei Zhang, Shizhen Xu, Ziyang Li, Fengming Zhu, Deheng Ye, Ting Chen, Jun Zhu
To the best of our knowledge, Tikick is the first learning-based AI system that can take over the multi-agent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios.
no code implementations • 19 Jun 2021 • Hua Wei, Deheng Ye, Zhao Liu, Hao Wu, Bo Yuan, Qiang Fu, Wei Yang, Zhenhui Li
While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected.
1 code implementation • 13 May 2021 • Menghui Zhu, Minghuan Liu, Jian Shen, Zhicheng Zhang, Sheng Chen, Weinan Zhang, Deheng Ye, Yong Yu, Qiang Fu, Wei Yang
In Goal-oriented Reinforcement learning, relabeling the raw goals in past experience to provide agents with hindsight ability is a major solution to the reward sparsity problem.
no code implementations • 5 Jan 2021 • Jiamou Sun, Zhenchang Xing, Hao Guo, Deheng Ye, Xiaohong Li, Xiwei Xu, Liming Zhu
The extracted aspects from an ExploitDB post are then composed into a CVE description according to the suggested CVE description templates, which is must-provided information for requesting new CVEs.
no code implementations • 18 Dec 2020 • Sheng Chen, Menghui Zhu, Deheng Ye, Weinan Zhang, Qiang Fu, Wei Yang
Hero drafting is essential in MOBA game playing as it builds the team of each side and directly affects the match outcome.
no code implementations • NeurIPS 2020 • Deheng Ye, Guibin Chen, Wen Zhang, Sheng Chen, Bo Yuan, Bo Liu, Jia Chen, Zhao Liu, Fuhao Qiu, Hongsheng Yu, Yinyuting Yin, Bei Shi, Liang Wang, Tengfei Shi, Qiang Fu, Wei Yang, Lanxiao Huang, Wei Liu
However, existing work falls short in handling the raw game complexity caused by the explosion of agent combinations, i. e., lineups, when expanding the hero pool in case that OpenAI's Dota AI limits the play to a pool of only 17 heroes.
no code implementations • 25 Nov 2020 • Deheng Ye, Guibin Chen, Peilin Zhao, Fuhao Qiu, Bo Yuan, Wen Zhang, Sheng Chen, Mingfei Sun, Xiaoqian Li, Siqin Li, Jing Liang, Zhenjie Lian, Bei Shi, Liang Wang, Tengfei Shi, Qiang Fu, Wei Yang, Lanxiao Huang
Unlike prior attempts, we integrate the macro-strategy and the micromanagement of MOBA-game-playing into neural networks in a supervised and end-to-end manner.
2 code implementations • IJCAI 2020 • Ke Xu, Yifan Zhang, Deheng Ye, Peilin Zhao, Mingkui Tan
One of the key issues is how to represent the non-stationary price series of assets in a portfolio, which is important for portfolio decisions.
no code implementations • 20 Dec 2019 • Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, Qiaobo Chen, Yinyuting Yin, Hao Zhang, Tengfei Shi, Liang Wang, Qiang Fu, Wei Yang, Lanxiao Huang
We study the reinforcement learning problem of complex action control in the Multi-player Online Battle Arena (MOBA) 1v1 games.