Computer Failure Classifications, Hardware and Software Error Sources, and Computer Reliability Measures
Computer-related failures may be categorized under the following five classifications [15]:
- • Classification I: Hardware failures. These failures are just like in any other piece of equipment, and they occur due to factors such as poor maintenance, unexpected environmental conditions, poor design, and defective parts.
- • Classification II: Software failures. These failures are the result of the inability of programs for continuing processing due to erroneous logic.
- • Classification III: Specifications failures. These failures are distinguished by their origin, i.e., defects in the system’s specification, rather than in the design or execution of either software or hardware.
- • Classification IV: Malicious failures. These failures are due to a relatively new phenomenon, i.e., the malicious introduction of programs intended for causing damage to anonymous users. Often these programs are called computer viruses.
- • Classification V: Human errors. These errors take place due to wrong actions or lack of actions by humans involved in the process (e.g., the system’s operators, builders, and designers).
There are many sources for the occurrence of hardware and software errors. Some of these sources are inherited errors, data preparation errors, handwriting errors, keying errors, and optical character reader. In a computer-based system, the inherited errors can account for over 50% of the errors [16]. Furthermore, data preparation-associated tasks can also generate quite a significant proportion of errors. As per Bailey [16], at least 40% of all errors come from manipulating the data (i.e., data preparation) prior to writing it down or entering it into the involved computer system.
Additional information on computer failure classifications and hardware and software error sources is available in Refs. [15,16].
There are many measures used in the area of computer system reliability. They may be grouped under the following two categories [14,17]:
- • Category I: This category contains four measures that are suitable for configurations such as standby, hybrid, and massively redundant systems. The measures are mean time to failure, system reliability, system availability, and mission time. It is to be noted that for evaluating gracefully degrading systems, these measures may not be sufficient.
- • Category II: This category contains the following five new measures for handling gracefully degrading systems.
- • Measure I: Mean computation before failure: This is the expected amount of computation available on the system prior to failure.
- • Measure II: Computation reliability: This is the failure-free probability that the system will, without an error, execute a task of length, say x, started at time t.
- • Measure III: Computation availability: This is the expected computation capacity of the system at a given time t.
- • Measure IV: Capacity threshold: This is the time at which certain value of computation availability is reached.
- • Measure V: Computation threshold: This is the time at which certain value of computation reliability is reached for a task whose length is, say, x.
Computer Hardware Reliability versus Software Reliability
As it is very important to have a clear comprehension of the differences between hardware and software reliability, a number of comparisons of important areas are presented in Table 6.1 [12,18,19].
Fault Masking
The term fault masking is used in the area of fault-tolerant computing, in the sense that a system with redundancy can tolerate a number of failures/malfunctions prior to its own failure. More clearly, the implication of the term is that some kind of problem has surfaced somewhere within the framework of a digital system, but because of design, the problem does not affect the overall operation of the system under consideration.
The best known fault masking method is probably modular redundancy and is presented in the following sections [12].
Triple Modular Redundancy (TMR)
In this case, three identical modules/units perform the same task simultaneously and the voter compares their outputs (i.e., the modules/units) and sides with the majority [12,20]. More clearly, the TMR system fails only when more than one module/unit fails or the voter fails. In other words, the TMR system can tolerate failure of a single module/unit. An important example of the TMR system’s application is the Saturn V
TABLE 6.1
Hardware and software reliability comparisons
No. |
Hardware Reliability |
Software Reliability |
1 |
Wears out |
Does not wear out |
2 |
Mean time to repair (MTTR) has significance |
Mean time to repair (MTTR) has no significance |
3 |
A hardware failure is generally due to physical effects |
Software failure is caused by programming error |
4 |
It is quite possible to repair hardware by using spare modules |
It is impossible to repair software failures by using spare modules |
5 |
The hardware reliability field is quite well developed, particularly in regard to electronics |
The software reliability field is relatively new |
6 |
Obtaining good failure-associated data is a problem |
Obtaining good failure-associated data is a problem |
7 |
Hardware reliability has well- developed theory and mathematical concepts |
Software reliability still lacks well- developed theory' and mathematical concepts |
8 |
Generally redundancy is effective |
Redundancy may not be effective |
9 |
Preventive maintenance is conducted to inhibit failures |
Preventive maintenance has no meaning in software |
10 |
Many hardware items fail as per the bathtub hazard rate curve |
Softw'are does not fail as per the bathtub hazard rate curve |
11 |
The failed item/system is repaired by conducting corrective maintenance |
Corrective maintenance is basically redesign |
12 |
Interfaces are visual |
Interfaces are conceptual |
launch vehicle computer [12,20]. The vehicle computer used TMR with voters in the central processor and duplication in the main memory [12,21].
The block diagram of the TMR scheme is shown in Figure 6.1 and the blocks in the diagram denote modules/units and the circle voter.
For independently failing modules/units and the voter, the reliability of the system in Figure 6.1 is given by [ 12]

where
R,mv is the reliability of the TMR system with voter. R is the reliability of the module/unit.
Rv is the reliability of the voter.

FIGURE 6.1 Block diagram for TMR system with voter.
With a perfect voter (i.e., 100% reliable), Equation (6.1) becomes
where
R, is the reliability of the TMR system with perfect voter.
It is to be noted that the voter reliability and the single unit’s reliability determine the improvement in reliability of the TMR system over a single unit system. For the perfect voter (i.e., Rv = 1), the TMR system reliability given by Equation (6.2) is only better than the single unit system when the reliability of the single unit is greater than 0.5.
At Rv = 0.8, the TMR system’s reliability is always less than the single unit’s reliability. Furthermore, when the voter reliability is 0.9 (i.e., Rv = 0.9), the TMR system’s reliability is only marginally better than the single unit/module reliability when the single unit/module reliability is approximately between 0.667 and 0.833 [22].
TMR System Maximum Reliability with Perfect Voter
For perfect voter, the TMR system reliability is expressed by Equation (6.2). Under this scenario, the ratio of Rlm/) to a single unit reliability, R, is given by [23]
By differentiating Equation (6.3) with respect to R and equating it to zero, we get
Thus, from Equation (6.4), we obtain R = 0.75. This simply means that the maximum values of the reliability improvement ratio, y, and the reliability of the TMR system, R,mp, are respectively:
and
Example 6.1
Assume that a TMR system’s reliability with a perfect voter is expressed by Equation (6.2). Determine the points where the single-unit and the TMR- system reliabilities are equal.
To determine the point, we equate a single unit’s reliability with Equation (6.2) to obtain
By rearranging Equation (6.5), we get
The above equation (i.e., Equation (6.6)) is a quadratic equation and its roots are
and
This means the reliabilities of the TMR system with perfect voter and the single unit are equal at R = l/2 or R = 1. Furthermore, the reliability of the TMR system with perfect voter will only be greater than the single unit’s reliability when the value of R is higher than 0.5.
TMR System with Voter Time-Dependent Reliability and Mean Time to Failure
With the aid of material presented in Chapter 3 and Equation (6.1), for constant failure rates of the TMR system units and the voter unit, the TMR system with voter reliability is expressed by [12,24].
where
Rlmv (/) is the TMR system witli voter reliability at time t.
A is the unit/module constant failure rate.
Avr is the voter unit constant failure rate.
By integrating Equation (6.9) over the time interval from 0 to we get the following equation for the TMR system with voter mean time to failure [12,14]:
where
MTTFlmv is the mean time to failure of the TMR system with voter.
For perfect voter (i.e., Xvr = 0), Equation (6.10) reduces to
where
MTTFlmp is the TMR system with perfect voter mean time to failure.
Example 6.2
Assume that the constant failure rate of a unit/module belonging to a TMR system with voter is Я = 0.0004 failures per hour. Calculate the system reliability for a 500-hour mission if the voter unit constant failure rate is AVJ. = 0.0002 failures per hour. In addition, calculate the TMR system mean time to failure.
By substituting the specified data values into Equation (6.9), we get
Similarly, by inserting the specified data values into Equation (6.10), we get
Thus, the TMR system with voter reliability and mean time to failure are 0.8264 and 1571.42 hours, respectively.
N-Modular Redundancy (NMR)
This is the general form of the TMR (i.e., it contains N identical modules/units instead of only three units).
The number N is any odd number, and the NMR system can tolerate a maximum of n modular/unit failures if the value of N is equal to (2n + 1). As the voter acts in series with the /V-module system, the complete system malfunctions whenever a voter unit failure occurs.
The reliability of the NMR system with independent modules/units is given by [12.25]
where
Rnmv is the reliability of NMR system with voter.
Rv is the voter reliability.
R is the module/unit reliability.
Finally, it is added that the time-dependent reliability analysis of an NMR system can be performed in a manner similar to the TMR system reliability analysis. Additional information on redundancy schemes is available in Nerber [26].