International Engineering Consortium
Web ProForums
Carrier-Grade, High-Availability Computing Platforms for Voice and Data Networks

1. Ensuring Continuous Availability of Communication Services

The Need for Software Fault-Tolerance

Services in today's communications networks must be highly robust. With the nearest competitor only one or two mouse clicks away, unplanned service downtime causes revenue loss and, in some cases, contractual penalties.

The computer industry has historically been concerned about hardware failures. Today, however, with amazingly high hardware reliability, the emphasis is shifting to software. Hardware reliability has increased because of process improvements in manufacturing and testing, and through the use of redundancy of those components most likely to fail. Redundant array of inexpensive disks (RAID), redundant fans, and redundant power supplies are examples of design improvements in today’s midrange computer systems.

The increasing complexity of software causes growing concern about the robustness of system software components. This is a direct consequence of the complexity that this software represents. Even when exceptional caution is taken, simple logic errors can result in catastrophic failures. The midair explosion of the first Arian-5 rocket and the massive failure of the North American signaling system 7 (SS7) network are recent examples.

Hardware fault tolerance does not protect against software failure; indeed, it may even replicate the failure. Thus, protection from software failure becomes the central challenge. It is practically impossible to prevent software failures completely, and design approaches that virtually eliminate software failures (typically used only when human life is at risk) can more than triple development costs and still not guarantee that failures will not occur.

This implies that solutions must be designed to tolerate hardware failures (e.g., through simple replication of hardware resources) while using software architectures that reduce the risk of service outage as a result of software failures.

Benefits of Software Replication

To achieve hardware fault tolerance, it is necessary to replicate all critical hardware components. Not surprisingly, the same principle applies for software fault tolerance. In order to protect a software application from software faults, the application’s logic and data must be replicated. Typically, the application’s logic and data must be distributed on a cluster of computer systems to ensure that it can tolerate any single hardware or software fault within the cluster.

As well as providing a higher level of service availability, software replication can also bring other benefits. It can be used to address scalability requirements, through load-sharing (n+1) architectures. It can be used as a basis for on-line upgrade procedures, at both the platform (hardware + operating system) and application level. It allows the use of standard off-the-shelf computer platforms, enabling service providers to benefit from the latest hardware technology advances and increased processor speeds. This typically ensures a better price-to-performance ratio when compared with a hardware fault-tolerance approach.

Of course, software replication technology can also be used in conjunction with hardware fault-tolerant systems; the use of one does not exclude the other.

However, the cost of developing software in a way that guarantees high availability should not be underestimated. The mechanisms required to replicate application logic and data can be complex and error-prone, leading to less stable solutions. The challenge facing the computer industry is to develop generic software replication frameworks that can be reused by a wide range of applications.

A generic software replication framework must address the following issues:

  • detecting software faults
  • ensuring rapid application-level recovery from failures
  • providing a persistent store for application state information, so that it can tolerate hardware and software faults
  • providing a single-system view of the replicated software, in particular on the management and provisioning interfaces

Registered Users
Enjoy exclusive access to free On-Line Education and receive the biweekly IEC newsletter.

IEC Newsletter
Get the latest industry information including critical insights from key industry leaders, technology briefings, and an Analyst Corner.
Current
Subscribe

Newsroom

IEC Corporate Member

Advertising Kit