no code implementations • 26 Sep 2020 • Gavin Rens, Jean-François Raskin, Raphaël Reynouad, Giuseppe Marra
In our formal setting, we consider a Markov decision process (MDP) that models the dynamics of the environment in which the agent evolves and a Mealy machine synchronized with this MDP to formalize the non-Markovian reward function.