Logo Logo
Switch language to English
Liu, Feng (2011): Supporting IT Service Fault Recovery with an Automated Planning Method. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik



Despite advances in software and hardware technologies, faults are still inevitable in a highly-dependent, human-engineered and administrated IT environment. Given the critical role of IT services today, it is imperative that faults, having once occurred, have to be dealt with eciently and eeffectively to avoid or reduce the actual losses. Nevertheless, the complexities of current IT services, e.g., with regard to their scales, heterogeneity and highly dynamic infrastructures, make the recovery operation a challenging task for operators. Such complexities will eventually outgrow the human capability to manage them. Such diculty is augmented by the fact that there are few well-devised methods available to support fault recovery. To tackle this issue, this thesis aims at providing a computer-aided approach to assist operators with fault recovery planning and, consequently, to increase the eciency of recovery activities.We propose a generic framework based on the automated planning theory to generate plans for recoveries of IT services. At the heart of the framework is a planning component. Assisted by the other participants in the framework, the planning component aggregates the relevant information and computes recovery steps accordingly. The main idea behind the planning component is to sustain the planning operations with automated planning techniques, which is one of the research fields of articial intelligence. Provided with a general planning model, we show theoretically that the service fault recovery problem can be indeed solved by automated planning techniques. The relationship between a planning problem and a fault recovery problem is shown by means of reduction between these problems. After an extensive investigation, we choose a planning paradigm that based on Hierarchical Task Networks (HTN) as the guideline for the design of our main planning algorithm called H2MAP. To sustain the operation of the planner, a set of components revolving around the planning component is provided. These components are responsible for tasks such as translation between dierent knowledge formats, persistent storage of planning knowledge and communication with external systems. To ensure extendibility in our design, we apply dierent design patterns for the components. We sketch and discuss the technical aspects of implementations of the core components. Finally, as proof of the concept, the framework is instantiated to two distinguishing application scenarios.