[CASPI] Causal-aware Safe Policy Improvement for Task-oriented Dialogue

The recent success of reinforcement learning (RL) in solving complex tasks is often attributed to its capacity to explore and exploit an environment.Sample efficiency is usually not an issue for tasks with cheap simulators to sample data online.On the other hand, Task-oriented Dialogues (ToD) are usually learnt from offline data collected using human demonstrations.Collecting diverse demonstrations and annotating them is expensive.Unfortunately, RL policy trained on off-policy data are prone to issues of bias and generalization, which are further exacerbated by stochasticity in human response and non-markovian nature of annotated belief state of a dialogue management system.To this end, we propose a batch-RL framework for ToD policy learning: Causal-aware Safe Policy Improvement (CASPI). CASPI includes a mechanism to learn fine-grained reward that captures intention behind human response and also offers guarantee on dialogue policy’s performance against a baseline. We demonstrate the effectiveness of this framework on end-to-end dialogue task of the Multiwoz2.0 dataset. The proposed method outperforms the current state of the art. Further more we demonstrate sample efficiency, where our method trained only on 20% of the data, are comparable to current state of the art method trained on 100% data on two out of there evaluation metrics.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here