Tuesday, August 31, 2010

Fault Tolerant Design Methodology

In my recent post, Fault Tolerant Management, I discussed a project proposal with the objective of defining a methodology for engineering stable and reliable self-* managment systems. One of the inspirations for this came from an interview by Robert S. Hamner that I had heard a few years ago on Software Engineering Radio Episode 77: Fault Tolerance with Bob Hanmer Part 1. There had been a lot of previous research covering various aspects of self-* management systems. Telecoms systems usually have high availability requirements, however, this is not necessarily the case for the management systems. It reasonable to expect that self-* management would be capable of operating independently, making decisions, detecting and recovering from failures - both to itself and to the network its managing.

One aspect of the project was to look at engineering techniques to accelerate migration from the typical legacy centralised management systems toward fault-tolerant self-* management systems. As I worked as System Engineer designing management systems for mobile networks, I was very interested in what Bob had say. His book Patterns for Fault Tolerant Software, which I'm reading at the moment, identifies patterns for each of the four phases of fault tolerance:
  • Error Detection
  • Error Processing including Error Recovery
  • Error Mitigation
  • Fault Treatment
Applying these patterns for the engineering of a self-* management system was the main theme of the research project.

Chapter 2 of the book introduces the Fault Tolerant Mindset where throughout the architecture design you should always be asking what can wrong! Not always easy for large systems that must perform a multitude of task concurrently. Chapter 2 concludes by outlining a fault tolerant design methodology. I'm finding the book very interesting and thought provoking so far and am looking forward to completing the rest of it.

No comments:

Post a Comment