We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting.
We use an indexed value function to represent uncertainty in our action-value estimates.
To the best of our knowledge, it is the first neural network-based contextual bandit algorithm with a near-optimal regret guarantee.
For cost reduction, we developed and experimentally tested and validated two approaches: using scaled-up big data jobs as proxies for the objective function for larger jobs and using a dynamic job similarity measure to infer that results obtained for one kind of big data problem will work well for similar problems.
We introduce entropy-based exploration (EBE) that enables an agent to explore efficiently the unexplored regions of the state space.
Our method combines elements from distributional reinforcement learning and approximate Bayesian inference techniques with neural networks, allowing us to disentangle both types of uncertainty on the expected return of a policy.
In our approach, we perform online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.