Tuesday, November 3, 2009

Armstrong chapter 5: Fault Tolerance

The first thing that jumped out at me during reading of this was the idea of decomposing tasks into simpler and simpler pieces so that correct, or at least safe, execution could be achieved in all cases. I originally misinterpreted this to have to do with trying to perform only subtasks of the original task in order to complete as much as possible before exiting. This of course would be very dangerous, for example, if at an online store the money is subtracted from the account but the items aren't credited to the user or even the other way around. As I read on it seems that the system being proposed is actually focusing in two related areas: recovery and dependence. Firstly, it sounds that if a task fails to execute we either restart that task hoping it returns correctly on a new run, or run an alternative task to ensure that the system is restored to the correct state from before the failed task was run. This would ensure that there are no instabilities in the system, since we are able to explicitely design the heirarchy to be able to recover from any set of errors with the input or the execution.

Reading further it seems that the greatest use, thanks to this And/Or Supervisor technique is to create a heirarchy that guarantees the proper execution of tasks in a specified order (done through careful construction of the supervisor tree). This still helps wtih the overall chapter goal of designing fault-tolerant software as we can build using the correct Ands and Ors a tree that will prevent the program from getting into a state that causes problems because of unfulfilled preconditions. Building to use this structure will take careful management and a mindful hand, but in systems where the ability to correctly handle all adverse conditions of execution and input designing for safety needs to already be something in the forefront of the designer's mind. It is much easier to design a system to use this approach from the ground up rather than trying to go back and reconfigure everything to work in this format. There is even a chance to design some parallelism into the system, if the subtasks are all dependence related rather than fault-recovery, we can create new threads or processes to execute them in an efficient manner.

No comments:

Post a Comment