- detectionThis is the process of discovering that an error exists. Detection is defined as the time from a failure causing a loss of service to the system becoming aware of it. Error detection is the responsibility of every hardware and software component in the system. To meet availability goals such as 99.999 percent, adequate error detection must be designed in, or a component may not be suitable for use in highly available systems.
- locationThis is the process of narrowing down the failure to the defective component. This process depends greatly on the definition of a fault zone. At this stage of the process, it is not necessary to locate an error to a region smaller than what will be isolated.
- isolationThis takes the defective portion of the system out of service. The region that is isolated must be bounded at a point where it can be removed from all interaction with the system.
- recoveryThis is the process of reassigning the necessary resources to restore the system to an operating state. Recovery also requires restoring any portions of the system that were adversely affected by the failing component. Recovery is the final step in the process that contributes to outage time. Once the system is providing complete service again, the remainder of the process does not directly contribute to outage time.
- reportingThis is the process that notifies the outside world that an event has taken place; it is the first step in the repair process. The repair process is indirectly related to availability. In systems employing redundancy, there is a statistical possibility that a second failure can occur in the component covering for this failure, which would result in a complete system outage. While the probability is low, the severity is high enough to make this a factor in the availability equation. It is important, even in redundant systems, to keep repair times low.
- repairThis the replacement of the defective component and is generally designated for the operator (human)-assisted portion of the process. This phase is separated for a number of reasons: it is usually the most time-consuming portion of the process; it is also a phase in the process where mistakes can account for system outages.
- reintegrationFinally, the repaired component is reintegrated. Once the defective hardware or software component has been replaced, it is brought back into service either as a new standby component or a sharer of the system load.
These distinctions are somewhat arbitrary, but they nevertheless illustrate differences in system architectures.
Fault Management in Clusters
In a 2N (clustered) system, which has highly encapsulated fault zones, the fault management cycle is somewhat simplified:
- Detection is crucial. The time a system is malfunctioning, without being detected, is considered a direct outage. The ability of a system to detect all possible failures is measured in its fault coverage. Anything not covered is assigned a probability and factored in as a severe outage.
- Location is implied. Any fault detected is usually detected within the node itself (assuming good fault coverage), and so the location is known.
- Isolation and recovery are essentially the same step. All the activity is moved off of the failing node to its standby, thus isolating the defective node and recovering the system at the same time.
- Clustered systems essentially move from detection to recovery. Location and isolation are inherent in the architecture of the system. The complexity of the recovery process is always dependent on the specific application running on the system.
The remainder of the process proceeds off-line. Here, too, the repair process can be as simple as replacing the entire node, or the technician may choose to further locate the failed component within the node. This diagnostic is done in an off-line system with full resources available to the repair process. A node-based system is relatively immune to mistakes made during this process.
Finely Grained Fault Management
In systems where node-based (2N) configurations are not economical, devices are spared on an N+1 arrangement. This requires a more finely grained fault management. The process becomes more complicated:
- Fault detection is always the same; it must be done by every resource in the system as quickly as possible.
- Location is done with an on-line diagnostic. This diagnostic should not interfere with the operating portion of the system. The goal is to keep as much of the system operating as possible to avoid total outages. Locating a failure in an active system is complicated; failures cannot always be precisely located. A typical metric for on-line fault identification is 95 percent fault location accuracy.
- Isolation is critical. To spare on more finely grained boundaries, the system must have an infrastructure that permits isolation of individual field replaceable units (FRUs). In an N+1 system, failed components must exist benignly while system activity continues.
- Recovery is more complicated for two reasons. With nodes, a clean boundary can be placed around a node that hides the complexities of the system. In an N+1 system, there is a hierarchy of component dependencies. When a component is determined to be bad, any other system components depending upon this resource must be recovered as well (see Figure 1). This aspect of topology management becomes part of the critical path to restoring service.

Figure 1. Example of a Dependency Tree
Another difficulty of fault recovery in this type of system is a result of incomplete encapsulation of faults. A failing device may push other devices into error states. After a defective device has been identified, all other error conditions must be cleaned up.
Reporting is application-dependent, but with more finely grained FRUs, a more detailed method of indication is needed. Reporting becomes crucial to the repair process. Because an operator or technician is repairing a portion of a live system, it is essential to identify properly the specific FRU to be replaced. Operator errors account for a significant loss of service in these types of systems. Clean design of a reporting mechanism can minimize these mistakes.
Repair, in a live system, requires some form of hot replacement. Again, a system must be designed to support this activity. Another consideration for repair of an active system is to ensure that FRUs do not mechanically interact. It is not efficient to be forced to remove one component to get to another.
The final phase of reintegration is again slightly more complicated because it is taking place in a live system. The fault management cycle in N+1 systems requires much more processing from the central processing unit (CPU) responsible for managing the system. If that CPU is also involved in system activity, there must be reserve processing capacity for fault management.


