- Computer Failure Classifications, Hardware and Software Error Sources, and Computer Reliability Measures
- Computer Hardware Reliability versus Software Reliability
- Fault Masking
- Triple Modular Redundancy (TMR)
- TMR System Maximum Reliability with Perfect Voter
- TMR System with Voter Time-Dependent Reliability and Mean Time to Failure
- N-Modular Redundancy (NMR)

# Computer Failure Classifications, Hardware and Software Error Sources, and Computer Reliability Measures

Computer-related failures may be categorized under the following five classifications [15]:

- •
**Classification I: Hardware failures.**These failures are just like in any other piece of equipment, and they occur due to factors such as poor maintenance, unexpected environmental conditions, poor design, and defective parts. - •
**Classification**II:**Software failures.**These failures are the result of the inability of programs for continuing processing due to erroneous logic. - •
**Classification**III:**Specifications failures.**These failures are distinguished by their origin, i.e., defects in the system’s specification, rather than in the design or execution of either software or hardware. - •
**Classification**IV:**Malicious failures.**These failures are due to a relatively new phenomenon, i.e., the malicious introduction of programs intended for causing damage to anonymous users. Often these programs are called computer viruses. - •
**Classification**V:**Human errors.**These errors take place due to wrong actions or lack of actions by humans involved in the process (e.g., the system’s operators, builders, and designers).

There are many sources for the occurrence of hardware and software errors. Some of these sources are inherited errors, data preparation errors, handwriting errors, keying errors, and optical character reader. In a computer-based system, the inherited errors can account for over 50% of the errors [16]. Furthermore, data preparation-associated tasks can also generate quite a significant proportion of errors. As per Bailey [16], at least 40% of all errors come from manipulating the data (i.e., data preparation) prior to writing it down or entering it into the involved computer system.

Additional information on computer failure classifications and hardware and software error sources is available in Refs. [15,16].

There are many measures used in the area of computer system reliability. They may be grouped under the following two categories [14,17]:

- •
**Category**I: This category contains four measures that are suitable for configurations such as standby, hybrid, and massively redundant systems. The measures are mean time to failure, system reliability, system availability, and mission time. It is to be noted that for evaluating gracefully degrading systems, these measures may not be sufficient. - •
**Category II:**This category contains the following five new measures for handling gracefully degrading systems. - •
**Measure**I:**Mean computation before failure:**This is the expected amount of computation available on the system prior to failure. - •
**Measure II: Computation reliability:**This is the failure-free probability that the system will, without an error, execute a task of length, say*x,*started at time*t.* - •
**Measure III: Computation availability:**This is the expected computation capacity of the system at a given time*t.* - •
**Measure**IV:**Capacity threshold:**This is the time at which certain value of computation availability is reached. - •
**Measure V: Computation threshold:**This is the time at which certain value of computation reliability is reached for a task whose length is, say,*x.*

# Computer Hardware Reliability versus Software Reliability

As it is very important to have a clear comprehension of the differences between hardware and software reliability, a number of comparisons of important areas are presented in Table 6.1 [12,18,19].

# Fault Masking

The term fault masking is used in the area of fault-tolerant computing, in the sense that a system with redundancy can tolerate a number of failures/malfunctions prior to its own failure. More clearly, the implication of the term is that some kind of problem has surfaced somewhere within the framework of a digital system, but because of design, the problem does not affect the overall operation of the system under consideration.

The best known fault masking method is probably modular redundancy and is presented in the following sections [12].

## Triple Modular Redundancy (TMR)

In this case, three identical modules/units perform the same task simultaneously and the voter compares their outputs (i.e., the modules/units) and sides with the majority [12,20]. More clearly, the TMR system fails only when more than one module/unit fails or the voter fails. In other words, the TMR system can tolerate failure of a single module/unit. An important example of the TMR system’s application is the Saturn V

TABLE 6.1

Hardware and software reliability comparisons

No. |
Hardware Reliability |
Software Reliability |

1 |
Wears out |
Does not wear out |

2 |
Mean time to repair (MTTR) has significance |
Mean time to repair (MTTR) has no significance |

3 |
A hardware failure is generally due to physical effects |
Software failure is caused by programming error |

4 |
It is quite possible to repair hardware by using spare modules |
It is impossible to repair software failures by using spare modules |

5 |
The hardware reliability field is quite well developed, particularly in regard to electronics |
The software reliability field is relatively new |

6 |
Obtaining good failure-associated data is a problem |
Obtaining good failure-associated data is a problem |

7 |
Hardware reliability has well- developed theory and mathematical concepts |
Software reliability still lacks well- developed theory' and mathematical concepts |

8 |
Generally redundancy is effective |
Redundancy may not be effective |

9 |
Preventive maintenance is conducted to inhibit failures |
Preventive maintenance has no meaning in software |

10 |
Many hardware items fail as per the bathtub hazard rate curve |
Softw'are does not fail as per the bathtub hazard rate curve |

11 |
The failed item/system is repaired by conducting corrective maintenance |
Corrective maintenance is basically redesign |

12 |
Interfaces are visual |
Interfaces are conceptual |

launch vehicle computer [12,20]. The vehicle computer used TMR with voters in the central processor and duplication in the main memory [12,21].

The block diagram of the TMR scheme is shown in Figure 6.1 and the blocks in the diagram denote modules/units and the circle voter.

For independently failing modules/units and the voter, the reliability of the system in Figure 6.1 is given by [ 12]

where

*R, _{mv}* is the reliability of the TMR system with voter.

*R*is the reliability of the module/unit.

*R _{v}* is the reliability of the voter.

FIGURE 6.1 Block diagram for TMR system with voter.

With a perfect voter (i.e., 100% reliable), Equation (6.1) becomes where

*R,* is the reliability of the TMR system with perfect voter.

It is to be noted that the voter reliability and the single unit’s reliability determine the improvement in reliability of the TMR system over a single unit system. For the perfect voter (i.e., *R _{v}* = 1), the TMR system reliability given by Equation (6.2) is only better than the single unit system when the reliability of the single unit is greater than 0.5.

At *R _{v} =* 0.8, the TMR system’s reliability is always less than the single unit’s reliability. Furthermore, when the voter reliability is 0.9 (i.e.,

*R*0.9), the TMR system’s reliability is only marginally better than the single unit/module reliability when the single unit/module reliability is approximately between 0.667 and 0.833 [22].

_{v}=### TMR System Maximum Reliability with Perfect Voter

For perfect voter, the TMR system reliability is expressed by Equation (6.2). Under this scenario, the ratio of *R _{lm/)}* to a single unit reliability, R, is given by [23]

By differentiating Equation (6.3) with respect to *R* and equating it to zero, we get

Thus, from Equation (6.4), we obtain *R =* 0.75. This simply means that the maximum values of the reliability improvement ratio, *y,* and the reliability of the TMR system, *R, _{mp},* are respectively:

and

**Example 6.1**

Assume that a TMR system’s reliability with a perfect voter is expressed by Equation (6.2). Determine the points where the single-unit and the TMR- system reliabilities are equal.

To determine the point, we equate a single unit’s reliability with Equation (6.2) to obtain

By rearranging Equation (6.5), we get

The above equation (i.e., Equation (6.6)) is a quadratic equation and its roots are and

This means the reliabilities of the TMR system with perfect voter and the single unit are equal at *R =* l/2 or *R* = 1. Furthermore, the reliability of the TMR system with perfect voter will only be greater than the single unit’s reliability when the value of *R* is higher than 0.5.

### TMR System with Voter Time-Dependent Reliability and Mean Time to Failure

With the aid of material presented in Chapter 3 and Equation (6.1), for constant failure rates of the TMR system units and the voter unit, the TMR system with voter reliability is expressed by [12,24].

where

*R _{lmv}* (/) is the TMR system witli voter reliability at time

*t.*

A is the unit/module constant failure rate.

A_{vr} is the voter unit constant failure rate.

By integrating Equation (6.9) over the time interval from 0 to we get the following equation for the TMR system with voter mean time to failure [12,14]:

where

*MTTF _{lmv}* is the mean time to failure of the TMR system with voter.

For perfect voter (i.e., *X _{vr} =* 0), Equation (6.10) reduces to

where

*MTTF _{lmp}* is the TMR system with perfect voter mean time to failure.

**Example 6.2**

Assume that the constant failure rate of a unit/module belonging to a TMR system with voter is Я = 0.0004 failures per hour. Calculate the system reliability for a 500-hour mission if the voter unit constant failure rate is A_{VJ}. = 0.0002 failures per hour. In addition, calculate the TMR system mean time to failure.

By substituting the specified data values into Equation (6.9), we get

Similarly, by inserting the specified data values into Equation (6.10), we get

Thus, the TMR system with voter reliability and mean time to failure are 0.8264 and 1571.42 hours, respectively.

## N-Modular Redundancy (NMR)

This is the general form of the TMR (i.e., it contains *N* identical modules/units instead of only three units).

The number *N* is any odd number, and the NMR system can tolerate a maximum of *n* modular/unit failures if the value of *N* is equal to (2*n +* 1). As the voter acts in series with the /V-module system, the complete system malfunctions whenever a voter unit failure occurs.

The reliability of the NMR system with independent modules/units is given by [12.25]

where

*R _{nmv}* is the reliability of NMR system with voter.

*R _{v}* is the voter reliability.

*R* is the module/unit reliability.

Finally, it is added that the time-dependent reliability analysis of an NMR system can be performed in a manner similar to the TMR system reliability analysis. Additional information on redundancy schemes is available in Nerber [26].