TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Gym halfcheetah-random	D4RL	EDAC	Normalized Average Return	28.4	# 1
Adroid door-cloned	D4RL	SAC-N	Normalized Average Return	-0.3	# 2
Gym halfcheetah-random	D4RL	SAC-N	Normalized Average Return	28	# 2
Gym halfcheetah-medium	D4RL	EDAC	Normalized Average Return	65.9	# 2
Gym halfcheetah-expert	D4RL	SAC-N	Normalized Average Return	105.2	# 2
Gym halfcheetah-medium-expert	D4RL	EDAC	Normalized Average Return	106.3	# 2
Gym halfcheetah-medium-replay	D4RL	SAC-N	Normalized Average Return	63.9	# 1
Gym halfcheetah-medium-replay	D4RL	EDAC	Normalized Average Return	61.3	# 2
Gym halfcheetah-full-replay	D4RL	EDAC	Normalized Average Return	84.6	# 1
Adroid pen-human	D4RL	EDAC	Normalized Average Return	52.1	# 1
Adroid door-human	D4RL	EDAC	Normalized Average Return	10.7	# 1
Adroid pen-cloned	D4RL	SAC-N	Normalized Average Return	64.1	# 2
Adroid hammer-cloned	D4RL	SAC-N	Normalized Average Return	0.2	# 2
Adroid relocate-human	D4RL	EDAC	Normalized Average Return	0.1	# 1
Adroid hammer-human	D4RL	EDAC	Normalized Average Return	0.8	# 1
Adroid hammer-human	D4RL	SAC-N	Normalized Average Return	0.3	# 2
Adroid relocate-cloned	D4RL	EDAC	Normalized Average Return	0	# 1
Adroid relocate-cloned	D4RL	SAC-N	Normalized Average Return	0	# 1
Adroid door-cloned	D4RL	EDAC	Normalized Average Return	9.6	# 1
Adroid hammer-cloned	D4RL	EDAC	Normalized Average Return	0.3	# 1
Adroid pen-cloned	D4RL	EDAC	Normalized Average Return	68.2	# 1
Adroid relocate-human	D4RL	SAC-N	Normalized Average Return	-0.1	# 2
Adroid door-human	D4RL	SAC-N	Normalized Average Return	-0.3	# 2
Adroid pen-human	D4RL	SAC-N	Average Return	9.5	# 1
Gym halfcheetah-full-replay	D4RL	SAC-N	Normalized Average Return	84.5	# 2
Gym halfcheetah-medium-expert	D4RL	SAC-N	Normalized Average Return	107.1	# 1
Gym halfcheetah-expert	D4RL	EDAC	Normalized Average Return	106.8	# 1
Gym halfcheetah-medium	D4RL	SAC-N	Normalized Average Return	67.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/gym-halfcheetah-random-on-d4rl)](https://paperswithcode.com/sota/gym-halfcheetah-random-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/gym-halfcheetah-medium-replay-on-d4rl)](https://paperswithcode.com/sota/gym-halfcheetah-medium-replay-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/gym-halfcheetah-full-replay-on-d4rl)](https://paperswithcode.com/sota/gym-halfcheetah-full-replay-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-pen-human-on-d4rl)](https://paperswithcode.com/sota/adroid-pen-human-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-door-human-on-d4rl)](https://paperswithcode.com/sota/adroid-door-human-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-relocate-human-on-d4rl)](https://paperswithcode.com/sota/adroid-relocate-human-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-hammer-human-on-d4rl)](https://paperswithcode.com/sota/adroid-hammer-human-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-relocate-cloned-on-d4rl)](https://paperswithcode.com/sota/adroid-relocate-cloned-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-door-cloned-on-d4rl)](https://paperswithcode.com/sota/adroid-door-cloned-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-hammer-cloned-on-d4rl)](https://paperswithcode.com/sota/adroid-hammer-cloned-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/adroid-pen-cloned-on-d4rl)](https://paperswithcode.com/sota/adroid-pen-cloned-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/gym-halfcheetah-medium-expert-on-d4rl)](https://paperswithcode.com/sota/gym-halfcheetah-medium-expert-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/gym-halfcheetah-expert-on-d4rl)](https://paperswithcode.com/sota/gym-halfcheetah-expert-on-d4rl?p=uncertainty-based-offline-reinforcement)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uncertainty-based-offline-reinforcement/gym-halfcheetah-medium-on-d4rl)](https://paperswithcode.com/sota/gym-halfcheetah-medium-on-d4rl?p=uncertainty-based-offline-reinforcement)`

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble

NeurIPS 2021 · Gaon An, Seungyong Moon, Jang-Hyun Kim, Hyun Oh Song ·

Offline reinforcement learning (offline RL), which aims to find an optimal policy from a previously collected static dataset, bears algorithmic difficulties due to function approximation errors from out-of-distribution (OOD) data points. To this end, offline RL algorithms adopt either a constraint or a penalty term that explicitly guides the policy to stay close to the given dataset. However, prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which themselves can be a non-trivial problem. Moreover, these methods under-utilize the generalization ability of deep neural networks and often fall into suboptimal solutions too close to the given dataset. In this work, we propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution. We show that the clipped Q-learning, a technique widely used in online RL, can be leveraged to successfully penalize OOD data points with high prediction uncertainties. Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning. Based on this observation, we propose an ensemble-diversified actor-critic algorithm that reduces the number of required ensemble networks down to a tenth compared to the naive ensemble while achieving state-of-the-art performance on most of the D4RL benchmarks considered.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

corl-team/CORL

389

yihaosun1124/OfflineRL-Kit

228

snu-mllab/edac

zzmtsvv/rl_task

Tasks

Add Remove

Adroid door-cloned

Adroid door-human

Adroid hammer-cloned

Adroid hammer-human

Adroid pen-cloned

Adroid pen-human

Adroid relocate-cloned

Adroid relocate-human

D4RL

Gym halfcheetah-expert

Gym halfcheetah-full-replay

Gym halfcheetah-medium

Gym halfcheetah-medium-expert

Gym halfcheetah-medium-replay

Gym halfcheetah-random

Offline RL

Q-Learning

reinforcement-learning

Reinforcement Learning (RL)

Value prediction

Datasets

D4RL

Results from the Paper

Edit

Ranked #1 on Gym halfcheetah-random on D4RL

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Gym halfcheetah-random	D4RL	EDAC	Normalized Average Return	28.4	# 1	Compare
Adroid door-cloned	D4RL	SAC-N	Normalized Average Return	-0.3	# 2	Compare
Gym halfcheetah-random	D4RL	SAC-N	Normalized Average Return	28	# 2	Compare
Gym halfcheetah-medium	D4RL	EDAC	Normalized Average Return	65.9	# 2	Compare
Gym halfcheetah-expert	D4RL	SAC-N	Normalized Average Return	105.2	# 2	Compare
Gym halfcheetah-medium-expert	D4RL	EDAC	Normalized Average Return	106.3	# 2	Compare
Gym halfcheetah-medium-replay	D4RL	SAC-N	Normalized Average Return	63.9	# 1	Compare
Gym halfcheetah-medium-replay	D4RL	EDAC	Normalized Average Return	61.3	# 2	Compare
Gym halfcheetah-full-replay	D4RL	EDAC	Normalized Average Return	84.6	# 1	Compare
Adroid pen-human	D4RL	EDAC	Normalized Average Return	52.1	# 1	Compare
Adroid door-human	D4RL	EDAC	Normalized Average Return	10.7	# 1	Compare
Adroid pen-cloned	D4RL	SAC-N	Normalized Average Return	64.1	# 2	Compare
Adroid hammer-cloned	D4RL	SAC-N	Normalized Average Return	0.2	# 2	Compare
Adroid relocate-human	D4RL	EDAC	Normalized Average Return	0.1	# 1	Compare
Adroid hammer-human	D4RL	EDAC	Normalized Average Return	0.8	# 1	Compare
Adroid hammer-human	D4RL	SAC-N	Normalized Average Return	0.3	# 2	Compare
Adroid relocate-cloned	D4RL	EDAC	Normalized Average Return	0	# 1	Compare
Adroid relocate-cloned	D4RL	SAC-N	Normalized Average Return	0	# 1	Compare
Adroid door-cloned	D4RL	EDAC	Normalized Average Return	9.6	# 1	Compare
Adroid hammer-cloned	D4RL	EDAC	Normalized Average Return	0.3	# 1	Compare
Adroid pen-cloned	D4RL	EDAC	Normalized Average Return	68.2	# 1	Compare
Adroid relocate-human	D4RL	SAC-N	Normalized Average Return	-0.1	# 2	Compare
Adroid door-human	D4RL	SAC-N	Normalized Average Return	-0.3	# 2	Compare
Adroid pen-human	D4RL	SAC-N	Average Return	9.5	# 1	Compare
Gym halfcheetah-full-replay	D4RL	SAC-N	Normalized Average Return	84.5	# 2	Compare
Gym halfcheetah-medium-expert	D4RL	SAC-N	Normalized Average Return	107.1	# 1	Compare
Gym halfcheetah-expert	D4RL	EDAC	Normalized Average Return	106.8	# 1	Compare
Gym halfcheetah-medium	D4RL	SAC-N	Normalized Average Return	67.5	# 1	Compare

Methods

Add Remove

Q-Learning

Edit Social Preview

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove