International Engineering Consortium
Web ProForums
Highly Available Embedded Computer Platforms Become Reality

5. Fault Management Software for Redundant Systems
There are some basic concepts that apply to all forms of redundant systems doing fault management. Again, the complexity of this software is affected by the architecture and the restart model.

Distributed Data Environment (DDE)

This is used to create a virtual environment for the applications and drivers. A DDE is the primary mechanism to support failover of hardware or software components. If a resource fails, the DDE can redirect traffic to the standby component without the application knowing about it.

Topology management is a resource that keeps track of system configuration. Part of the role of topology management is to identify where the redundant resources reside. In 2N systems, topology management can be as simple as keeping track of which system is active and which is standby. In larger clusters of systems, topology management may be relegated to a specific node in the cluster.

In N+1 systems, topology management must keep track of all the dependencies in the system. Topology management becomes the responsibility of each system and can even be needed in individual subsystems in a larger system. Because the fault-management process requires extensive and immediate access to the system topology, topology management is a key component.

While it is possible to have dissimilar spare devices in either type of system, it is more prevalent in N+1 systems. Topology management is more complex in N+1 systems and is more extensible to heterogeneous configurations. Systems taking advantage of the simplicity of a 2N topology usually only support identical spares.

Event Management

This is the focal point for handling exceptions. The event manager receives messages from all portions of the system and performs the necessary system management functions. These operations may be part of normal system operation in the case of a system upgrade or a fault management cycle in the case of a system exception.

In systems with a fault management cycle as simple as turning over control to its standby, event management might not even be identified as a separate process. In systems with complex fault management, this service can be as sophisticated as a rules-based interpreter that is highly customizable.

Application Management

In addition to these core services, there are other services needed in more complex forms of fault management. One of these services is application management. The application manager provides the system with a means to monitor the health of applications and signal exceptions. By making this a system service, applications can be written to a standard application programming interface (API) and not have to deal with the underlying fault management infrastructure.

Checkpoint Service

This provides a channel(s) to save state data between active and standby components. The checkpoint service can abstract the interaction between a component and its standby. The application does not need to know about the system architecture or the restart model. In a warm restart implementation, the checkpoint service saves the state information with the system until a standby is needed. In a hot restart implementation, the data is sent immediately to the standby.

Checkpointing can be performed in several ways. In active checkpointing, the application notifies the service when data is ready to be copied. In passive checkpointing, the application registers a data structure with the service, and the service asynchronously collects the information at specific intervals. Either technique can be useful, depending upon the application.

Heartbeat Protocol

This is used as a basic fault-detection mechanism. Components in a highly available system use the heartbeat protocol to signal that they are still functioning. If a component fails to check in at the appropriate interval, corrective action can be taken. The heartbeat protocol can be tied in with the checkpoint service. The transport of state data can be synchronized with an application’s heartbeat.

On-Line and Off-Line Diagnostics

These play an important role in a highly available system. A diagnostic manager schedules independent activities to check the health of the system. Any activity that improves fault detection in a system improves availability. A crucial responsibility of this activity is latent fault detection. It is important to test portions of a system that are unused so that they are available when needed.

System Management

Finally, any system is usually part of a larger system. An interface for system management such as signaling network management protocol (SNMP) or common management information protocol (CMIP) is essential in any highly available system (see Figure 4). This service ties into the topology management of the system to provide status to the network agent and to allow remote access to control the system configuration.


Figure 4. High-Availability Software Services

A well-designed set of system services can alleviate much of the work an application must do to be highly available. By providing a stable platform for system implementation, high degrees of availability can be achieved. High availability requires careful design at all levels of system implementation.

Registered Users
Enjoy exclusive access to free On-Line Education and receive the biweekly IEC newsletter.

IEC Newsletter
Get the latest industry information including critical insights from key industry leaders, technology briefings, and an Analyst Corner.
Current
Subscribe

Newsroom

IEC Corporate Member

Advertising Kit