We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.

171

Paper
Code

Self-critical Sequence Training for Image Captioning

ruotianluo/ImageCaptioning.pytorch • • CVPR 2017

In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.

Paper
Code

Trust Region Policy Optimization

DLR-RM/stable-baselines3 • • 19 Feb 2015

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement.

Paper
Code

High-Dimensional Continuous Control Using Generalized Advantage Estimation

labmlai/annotated_deep_learning_paper_implementations • • 8 Jun 2015

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks.

Paper
Code

Deep Reinforcement Learning for Dialogue Generation

liuyuemaicha/Deep-Reinforcement-Learning-for-Dialogue-Generation-in-tensorflow • • EMNLP 2016

Recent neural models of dialogue generation offer great promise for generating responses for conversational agents, but tend to be shortsighted, predicting utterances one at a time while ignoring their influence on future outcomes.

Paper
Code

Competitive Policy Optimization

manish-pra/copg • • 18 Jun 2020

A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties.

Paper
Code

Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models

snakeztc/NeuralDialog-LaRL • • NAACL 2019

Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge.

Paper
Code

Distributional Policy Optimization: An Alternative Approach for Continuous Control

tesslerc/GAC • • NeurIPS 2019

We show that optimizing over such sets results in local movement in the action space and thus convergence to sub-optimal solutions.

Paper
Code

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

allenai/rl4lms • • 3 Oct 2022

To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL.

Paper
Code

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

shaneshixiang/rllabplusplus • • 7 Nov 2016

We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation.

Paper
Code

Policy Gradient Methods

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result