International Engineering Consortium
Web ProForums
Highly Available Embedded Computer Platforms Become Reality

1. Fault Tolerance versus High Availability
Fault tolerance and high availability are not quite the same thing. Fault tolerance encompasses two properties: transactional reliability and high system availability. Many systems today use fault-tolerant computers when the application only needs availability. Transactional reliability is needed in banking applications or in billing records for a telecommunications network. Availability is needed in file server applications or in call-processing applications. Transactional reliability is not needed for any application where some loss of data is tolerable or where data transfer is protected by a reliable end-to-end protocol such as transmission control protocol (TCP)/Internet protocol (IP).

Traditional implementations of fault-tolerant platforms often involve proprietary hardware and software. This causes higher costs and longer design cycles—two things that may not be acceptable in emerging, competitive markets such as telecommunications. The challenge is to provide highly available platforms without resorting to extreme or expensive measures.

Availability is often expressed in percentages. A 365x24 system with 99.9 percent availability has an average down time of 8.76 hours per year (525 minutes). A system with just 5 minutes of service outage must have 99.999 percent availability.

Availability is calculated using statistical models for all the system components. The simplest model for a component is a binary model. The component is either in or out of service. Availability can be calculated from failure rates (meantime between failures [MTBF]) and repair times (meantime to repair [MTTR]). The average downtime contribution by any component is calculated by amortizing the MTTR time over the MTBF period. For example, if a component critical to the operation of the platform has an MTBF of 250,000 hours and a MTTR of one hour, it contributes 2.1 minutes of downtime to the system per year (60 minutes/250,000 hours/8,760 hours/year). A more complex model for the component, which considers partial outages, requires the use of statistical methods to calculate availability.

Availability in the two 9s or three 9s range (99 percent to 99.9 percent) can be achieved by maximizing the reliability of components and minimizing repair times. To achieve higher reliability or to compensate for less reliable components, redundancy is used. Having a backup for a component that fails keeps the system operating. Availability of redundant configurations is calculated based on the time to detect and switch over to the redundant component. Fault management becomes a critical factor in system design.

Registered Users
Enjoy exclusive access to free On-Line Education and receive the biweekly IEC newsletter.

IEC Newsletter
Get the latest industry information including critical insights from key industry leaders, technology briefings, and an Analyst Corner.
Current
Subscribe

Newsroom

IEC Corporate Member

Advertising Kit