# Precision Considerations: Guard Bits

Several important aspects are associated with the implementation of floating-point operations and their subsequent representations. Although the significands (mantissas) of initial operands and final results are limited to a specified number of bits (e.g. 24 bits for single format, including the implicit leading 1), it is still often necessary to have some extra bits to accommodate the results of the execution of intermediate steps of any type of arithmetic operation. Fortunately, the ALU registers that hold the exponent and significand of each operand prior and after a floating-point operation are always greater in length than the length of the significand plus an implied bit. The register thus automatically provides additional bits to the operands as well as to the final results, and these types of extra bits that are retained are often called the guard bits, which help to realize the maximum accuracy in the final results.

## Truncation

At the time of generating any result, it is often required to remove the guard bits from the extended significand by appropriately chopping it off to bring it to a specified length that simply approximates the longer version. This type of act making no changes in the retained bits is commonly known as truncation.

Truncation and its significant impact on the final result are, however, given in the website: http://routledge.com/9780367255732.

## Rounding: IEEE Standard

Rounding is essentially a variant of truncation that also disposes guard bits (extra bits) when represented in a specific format. Different types of rounding, including Von Neumann rounding, regular rounding in general, and also the rounding specified in the IEEE 754 floating-point standards, as the default mode for truncation are mostly common. All other truncation methods specified in IEEE standards are referred here as rounding modes with a list o(four alternative approaches, namely round to nearest, round to -н», round to and round towards zero are significant.

When the guard bits are not present or are removed by truncation (and rounding) at each intermediate step of computation, the amount of error ultimately crept into the final result may be then appreciably high. Therefore, the way that the guard bits and truncation (rounding) are to be used as specified in the IEEE floating-point standard is to enforce a maximum within half a unit in the LSB position of the final result. In general, this requires a rounding scheme in which only three guard bits are to be carried along with the needed operations during the computation of each intermediate step. The first two of these three bits are the two most significant bits of the section of the significand to be removed. The third bit is the logical-OR of all the bits beyond these first two bits in the full representation of the significand. From an implementation point of view, this bit is relatively easy to maintain during the intermediate computational steps to be performed. Initially, this bit is set to 0. If a 1 is shifted out through this position, the bit becomes 1 and retains that value. That is why, this bit is sometimes called the sticky bit.

Details of different types of rounding as stated above, including those as mentioned truncation methods specified in IEEE standards, are given in the website: http://routledge. com/9780367255732.

## Infinity; NaNs; and Denormalized Numbers: IEEE Standards

IEEE 754 has not only defined and described various methods to be adopted for different rounding modes as already explained, but also formulated many other aspects, including the procedures, to be followed for the floating-point arithmetic so that whatever be the hardware platform used for execution, a uniform and predictable result can always be obtained. However, the main focus is, at present, on three such important aspects related to floating-point arithmetic introduced by IEEE standards, namely infinity, NaNs, and denormalized numbers.

Infinity: Real arithmetic considers infinity as a limiting case that always produces the infinity values in certain situations abided by the following:

-«< (any finite number) < + <*>

Any arithmetic operation involving infinity, excepting some special cases as will be discussed later, precisely yields the obvious results such as:

If .v is any finitely expressible number, then

The other cases in this category involving °° are essentially NaNs.

NaN: A NaN is essentially a special value encoded in floating-point format, which is generated as the result when an invalid operation is performed. NaNs are of two types in general: (i) signalling NaNs and (ii) quiet NaNs.

A signalling NaN conveys (signals) an exception whenever an invalid operation involving an operand is attempted. Signalling NaN is accompanied with such values that lie beyond the domain prescribed by IEEE standards.

A quiet NaN, on the other hand, smoothly propagates through almost every arithmetic operation without issuing any intimation of an exception, e.g., the operation (-°°) - (-<*>), 0 x «о, 0 + 0, etc. As a quiet NaN moves lurking, it somehow appears to be hazardous because it may give rise to a situation that may sometimes be fatal.

Although IEEE 754 Standard provides the same general format for both types of NaNs, these two kinds of NaNs are precisely implementation-specific in the way they will be represented, so that they can be uniquely identified by the system to appropriately handle numerous exception conditions.

Denormalized numbers: The normalization process is to be executed compulsorily to generate normalized numbers in any floating-point arithmetic operation following IEEE 754 Standard. However, if only normalized numbers are used, then there exists a reasonable gap between the smallest normalized number and 0 (Figure 7.17). In the case of single format (32-bit) under IEEE 754, there are 223 representable numbers in each interval, and the smallest representable positive number is 2'126. If the denormalized numbers can be included in this format, an additional 223 - 1 numbers could then be uniformly added between 0 and 2~126.

Denormalized representation has an exponent 0, and in the fractional part f, there is no assumed leading 1 before the binary point. In case of denormalized numbers, the exponent field is 1-bias, instead of 0-bias, where bias = (2k ~1 - 1), к is the number of bits in the exponent field. Therefore, the value of a denormalized positive number is / x 2~126. For example,

The largest denormalized 32-bit (single-precision) number is:

The smallest denormalized 32-bit (single-precision) number is

Denormalized numbers are, therefore, equally useful and hence are included in this standard to mainly handle the cases of exponent underflow. When the exponent of the result becomes too small (a large negative exponent), the result needs to be denormalized to get out of the situation by simply right shifting the fraction (significand) and incrementing the exponent accordingly for each such shift until the exponent comes within a representable range.

The inclusion of denormalized numbers in the IEEE 754 Standard prevents the density of representable numbers to increase as one approaches from the point of smallest representable normalized number towards 0. Thus, the use of denormalized number precisely helps to smoothen the said density to be mostly uniform in the said domain, and that is why, it is sometimes referred to as gradual underflow. In effect, it fills the gap reducing the width between the smallest representable nonzero number and zero, and minimizes the effect of exponent underflow to such a level that is almost comparable to normalized numbers with rounding-off.

# Summary of Floating-Point Numbers

The significance of the usual bit patterns in the IEEE 754 Standard formats and its interpretations, including some unusual bit patterns, to represent special values have been already described. The extreme exponent values of all zeros (0) and all ones (255 in single format and 2047 in double format), however, define special values that consist of many different types, as already explained.

Value = (—l)_s x M x 2E, where S = sign, M = significand (mantissa), and E = exponent

Bias = 2k -1 - 1, where к = number of bits in exponent field For 32-bit (single-precision), bias = 2s ~1 - 1 = 127.

For 64-bit (double-precision), bias = 2n ~1 - 1 = 1023.

Normalized: Significand field has implied (hidden) leading 1. Exponent field contains at least one "1". E = unsigned value of exponent field with appropriate bias.

Denormalized: Significand field has implied leading 0. All exponent field bits are equal to 0. E = 1-bias.

Special cases: NaN, infinity, and denormalized (already described).

# Summary

Numerous arithmetical and logical (non-numerical) operations on various types of operands, including fixed-point and floating-point numbers, are carried out by the data processing part of a CPU, a major constituent of which is an ALU. Most of the modern processors nowadays incorporate numerous types of instructions in their instruction set to enable ALU to carry out all these numerous operations, and in many cases also accompanied by the required hardware to process floating-point instructions as well. Computer arithmetic circuit designs, however, presently exhibit several interesting well-developed logic designs, including high-performance adder designs, and sophisticated design of both multiplication and division units using Booth algorithm and restoring/non-restoring division algorithms, respectively. Floating-point and other complex operations are implemented by an autonomous execution unit within the CPU or by a supporting co-processor which is a program-transparent extension to the CPU. A floating-point processor is typically composed of a pair of fixed-point ALUs: one to process exponents and the other to process mantissas. Special circuits are yet needed for normalization, and also for exponent comparison and mantissa alignment in the case of floating-point addition and subtraction. The floating-point number representation standard proposed by IEEE has been described, and a set of rules under this specification for performing all four basic arithmetic operations has been given, including the options of special values and exceptions.

# Exercises

• 7.1 What range of decimal values can be represented by a four-digit hex number? Convert 325910 to hexadecimal, and then from hexadecimal to binary.
• 7.2 How many bits are required to count up to decimal 1 million? Convert 87410 to octal, and then from octal to binary.
• 7.3 A small process-control computer uses hexadecimal codes to representitsl6-bit memory addresses.

a. How many hex digits are required?

b. What is the range of addresses in hex?

c. How many memory locations are there?

• 7A Convert the decimal number 927.4510 to an equivalent binary number.
• 7.5 What is an advantage of encoding a decimal number in BCD rather than in straight binary? What is a disadvantage?
• 7.6 Represent the decimal value 195 by its straight binary equivalent. Then encode the same decimal number using BCD code. Convert the binary number 0101 to its equivalent Gray code.
• 7.7 Explain with reasons the minimum and maximum value of integers that can be represented by an и-bit data using

a. signed-magnitude method

b. signed l's complement method

c. signed 2's complement method

7.8 Represent the number -21 in 8-bit format using

a. signed-magnitude method

b. signed l's complement method

c. signed 2's complement method

• 7.9 Explain the principle of a CLA with an appropriate diagram considering binary numbers of 4 bits. State the merits of this adder as well as its drawbacks.
• 7.10 Show how to extend the 16-bit design of Figure 7.18 to a 64-bit adder using the same two component types: a 4-bit adder module and a 4-bit carry-lookahead generator.
• 7.11 Write down the Boolean expression for overflow condition when adding or subtracting two binary numbers expressed in two's complement. [For answer, see Section 7.3, Alternative approach.]
• 7.12 Show that the logic expression с„©си _ г is a correct indicator of overflow in the addition of 2's complement integers.
• 7.13 "CSA is called a fast adder". Explain with an example why it is so called. Describe the principle and its operation with a schematic diagram.
• 7.14 Multiply each of the following pairs of signed 2's complement numbers using the Booth's algorithm, assume that X is the multiplicand and Y is the multiplier.

i. X = 110101 and Y = 011011

ii. X = 010111 and Y= 110110

iii. X = +14 and Y = -13

• 7.15 Use Booth algorithm to multiply -28 (multiplicand) by -12 (multiplier), where each number is represented using 6 bits.
• 7.16 Describe the situation of worst case and the best case when Booth's algorithm is used.
• 7.17 With restoring method used in two's complement integer division algorithm, the value in the A register must be restored following unsuccessful subtraction. A slightly more complex approach, known as non-restoring, avoids the unnecessary subtraction and addition. Derive an algorithm for the latter approach.
• 7.18 Divide -121 by 13 in binary two's complement notation, using 8-bit words. Use both restoring and non-restoring division approaches.
• 7.19 How the non-restoring division algorithm can be derived from restoring division algorithm?
• 7.20 What is meant by floating-point representation of a number system? Why it is so called? How a binary number N is represented in a floating-point notation?
• 7.21 What are the four essential elements of a number in the floating-point representation?
• 7.22 A floating-point number system uses 16 bits for representing a number. The most significant bit is the sign bit. The least significant nine bits represent the signifi- cand (mantissa), and the remaining six bits represent the exponent. Assume that the numbers are stored in the normalized format with as usual one hidden bit

a. Give the representation of -2762.5 x 10~2.

b. Compute the value represented by 1 001010 011000000.

• 7.23 What is the benefit obtained by using biased representation for the exponent portion of a floating-point number?
• 7.24 What would be the bias value for

a. A base-8 exponent (B = 8) in a 5-bit field?

b. A base-16 exponent (B = 16) in a 6-bit field?

• 7.25 A 32-bit number can represent a maximum of 232 different numbers. How many different numbers can be represented in the IEEE754 single-precision 32-bit format? Explain.
• 7.26 Represent the following decimal numbers in IEEE 754 single-precision format:

a. -7

b. -1.75

c. 389

d. 245.625

e. 1/16

f. -1/32

7.27 The following numbers use the IEEE 32-bit floating-point format. What is the equivalent decimal value?

a. 0101 0101 0110 0000 0000 0000 0000 0000

b. 1100 0011 0110 0000 0000 0000 0000 0000

c. 0011 1111 1010 1100 1000 0000 0000 0000

• 7.28 Compute the content of mantissa field when -26.75 is to be stored with the value of M (mantissa) interpreted as (-l)s 2£_311. M, where mantissa M is of 8 bits, exponent is of 6 bits, and 1 bit is used for sign. [Hint: (26.75),0 = 11010.11 = 1.101011 x 24, M = 10101100]
• 7.29 Consider a floating-point format with 8 bits for the biased exponent and 23 bits for the significand (mantissa). Show the bit pattern for the following numbers in this format:

a. -549

b. 0.645

7.30 Show how the following floating-point additions are performed in which significands are truncated to 4 decimal digits.

a. 6.487 x 102 + 5.693 x 102

b. 8.546 x 102 + 7.425 x 10"2

Show the results also in normalized form.

7.31 Show how the following floating-point calculations are made in which significands are truncated to 4 decimal digits.

a. 8.748 x 10'3 - 6.593 x 10~3

b. 7.756 x 10-3 - 2.259 x Ю'1

Show the results also in normalized form.

7.32 Show how the following floating-point computations are performed in which significands are truncated to 4 decimal digits.

a. (6.432 x Ю2) x (2.154 x 10°)

b. (7.756 x 103) x (2.259 x 102)

Show the results also in normalized form.

• 7.33 State and explain the significance and implications of guard bits.
• 7.34 Which of the following truncation technique has the smallest unbiased rounding error and why?

i. Chopping

ii. Rounding

iii. Von Neumann rounding

iv. Both (ii) and (iii)

• 7.35 If X = 1.586, find the relative error if X is truncated to 1.58, and if it is rounded to 1.59.
• 7.36 In computer-based calculations, one of the critical errors occur when two nearly equal numbers are subtracted. Assume X = 0.23186 and Y = 0.23143. The computer truncates all values to four decimal digits. X' = 0.2318 and Y' = 0.2314

a. What are the relative errors for X' and Y'?

b. What is the relative errors for Z' = X' - Y'?

7.37 Explain how NaN and infinity are represented in IEEE754 standard.

# Suggested References and Websites

Hamacher, C, Vranesic, Z. G., and Zaky, S. G. Computer Organisation, 5th ed., Int'l. ed. McGraw-Hill Higher Education, 2002.

Hayes, J. P. Computer Architecture and Organisation, Int'l ed. WCB/McGraw-Hill, 1998.

Mano, M. Logic and Computer Design Fundamentals. Upper Saddle River, NJ: Prentice-Hall, 2004. IEEE 754: The IEEE 754 documents, related publications and papers, and a useful set of links related to computer arithmetic.