Improving Existing Fault Recovery Policies

NeurIPS 2009  ·  Guy Shani, Christopher Meek ·

Automated recovery from failures is a key component in the management of large data centers. Such systems typically employ a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we explain how to use data gathered from the interactions of the hand-made controller with the system, to create an optimized controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. While our paper focuses on a specific domain, our method is applicable to other systems that use a hand-coded, imperfect controllers.

PDF Abstract


  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here