Control Hazard

In any computer program using any programming language, there is normally a need for some kind of statement that allows the flow of control to be something other than sequential. Instructions that do this are called branches. Whenever a branch instruction is executed, the action after the branch instruction is called a branch taken that fetches a nonsequential or remote instruction, called a branch target. The address of the branch target is then to be loaded into the program counter (PC), which will invalidate all the instructions that are either in the pipeline or prefetched in the buffer. This invalidation leads to draining, and subsequent refilling of the pipeline afresh from the branch target onwards is carried out whenever a branch is taken. As a result, the performance of the pipeline is adversely affected since the address of the next instruction to be executed is not exactly known until the instruction that causes branching (transfer of control out of sequence) is executed. The net effect is that the throughput of the pipeline is drastically reduced, when compared to that of a sequential processor. The clock cycles lost due to draining of partly executed instructions lying at different stages in the pipeline and subsequent refilling of the pipeline to start afresh is often referred to as the branch penalty. It is to be noted that the presence of a branch instruction does not cause the pipeline to automatically drain and begin refilling. However, a branch, if it is not taken, allows the normal uninterrupted sequential flow of instructions as usual to continue in the pipeline to be executed.

Only when a branch is taken does the problem arise. This summarily indicates that the presence of the branch instruction, if not handled properly, is one of the critical problems in designing and operating an instruction pipeline (Dubey, P., et al.).

Branch instruction, in general, can be classified into three groups:

i. Unconditional branch (simple branch, e.g., go to, subroutine call, interrupt servicing, return, etc.)

ii. Conditional branch (e.g., if-then-else).

iii. Loop branch (e.g., for loop).

An unconditional branch always alters the sequential program flow, and accordingly sets a new target address in the program counter, and the execution flow then starts from the target address called a target path onwards, rather than incrementing the program counter by 1 to point to the next sequential instruction as is normally the case. A conditional branch sets a new target address in the program counter only when a certain condition, usually based on a condition code, is satisfied. It then selects a path of instructions that starts from the target address called a target path. If the condition is not satisfied, the path starts from the next sequential instruction of the branch instruction as usual, and is called a sequential path. A loop branch in a loop statement usually jumps back to the beginning of the loop and executes the loop either a fixed or variable number of times depending on certain situation (data-dependent).

8.5.2.3.1 Unconditional Branches and Solution Approaches

To demonstrate the operation of a pipeline while executing an unconditional branch instruction, let us consider Figure 8.11 which shows a program segment consisting of a sequence of instructions Ia—13 stored at successive memory locations being executed in a five-stage pipeline. Assume that the instruction I2 is an unconditional branch instruction and the corresponding branch target instruction is IK. In clock cycle 4, the decoding operation for I3 is in progress while the branch instruction I2computes its target address. In clock

FIGURE 8.11

Idle cycles caused by branch instruction [address computed at the CO stage].

cycle 5, the processor must discard I3 and I4 which have been partly processed and clear the pipeline for all the instructions that appeared after the branch instructions and already in different pipeline stages. The processor then fetches the branch target instruction IK. The Execute stage is instructed to do nothing during that clock period. The ultimate effect is that the entire pipeline then comes to a stall. This clearing of pipeline stages by draining out the partly executed instructions from all the stages appearing after the branch instruction and also idle staying of units (as shown X and blank in the Figure 8.11) causes a significant degradation in pipeline performance which is often referred to as branch penalty. For a deeper pipeline (pipeline having a higher number of stages), the branch penalty paid may be even higher.

One feasible approach aimed at reducing the branch penalty is to compute the branch address earlier in the pipeline as shown in Figure 8.12 where target address has been computed in the decoding stage (ID), rather than at a later stage (CO stage) as carried out above. The instruction I4 has not even entered the pipeline, the number of pipeline stages to be cleared is decreased, and the branch penalty to be paid will be definitely reduced. Typically, the fetch unit has dedicated hardware to identify a branch instruction and compute the branch target address as quickly as possible after an instruction is fetched. This approach thus substantially reduces the branch penalty, even for a deeper pipeline.

Another approach being used to reduce the branch penalty is to employ a sophisticated fetch unit that can fetch instructions before they are needed and put them in a queue which can store several instructions. An additional separate unit, called the dispatch unit, takes instructions from the front of the queue, performs decoding functions, and finally issues them to the next stage in the pipeline for execution as shown in Figure 8.13.

To make this scheme effectively operative, the fetch unit must be equipped with sufficient decoding and processing capabilities to identify and execute branch instructions. The fetch unit always tries to maintain the queue full most of the time to ensure an adequate supply of instructions for processing. This approach reduces the impact of occasional delays due to fetching of instructions while the pipeline is facing a branch or a cache miss, as the dispatch unit still continues to issue instructions from the instruction queue. Conversely, when the pipeline stalls, for example, due to a data hazard, the dispatch unit is not in a position to issue instructions from the instruction queue. The fetch unit, however, at that

FIGURE 8.12

Idle cycles caused by branch instruction [address computed at the earlier stage (DI stage)].

FIGURE 8.13

Instruction queue included in the hardware-pipelined organisation.

time still continues to fetch instructions and add them to the instruction queue to keep it full. Since the fetch unit is capable of identifying and executing branch instructions, this operation is carried out by the fetch unit concurrently with the execution of other instructions. This technique is sometimes referred to as branch folding. When the branch folding occurs due to the appearance of a branch instruction, the instruction queue must have at least one instruction other than the branch instruction to avoid a pipeline stall. That is why, in many processors, the bus width between the fetch unit and the instruction cache is such that it allows reading more than one instruction in each clock cycle.

The instruction queue has a similar contribution while dealing with cache misses. When a cache miss occurs, the fetch unit is then engaged to read the desired cache block from the main memory or from a secondary cache leaving its primary task of instruction fetching. As a result, the pipeline may be stalled. But in such a situation, the dispatch unit removes this stall by sending instructions continually for execution as long as the instruction queue is not empty. When the fetch operation is once again resumed, the instruction queue is refilled. Thus, in the event of a cache miss, while it has the support of a nonempty instruction queue, has almost no impact on the rate of instruction execution.

8.5.2.3.2 Conditional Branches and Solution Approaches

A conditional branch when being taken, breaks off the execution of normal sequential flow of instruction stream, requiring a new target address to be set in the program counter. As a result, it causes the pipeline to stall, since a new path of instructions is selected other than the usual flow of the next sequential instruction that starts from the target address. We will now examine the effect of branch instructions on the overall performance of the pipeline.

8.5.2.3.2.1 Effect of Branching Among all the preceding branch types already mentioned, conditional branches are possibly the hardest to handle. Appearance of a conditional branch instruction can invalidate several instructions already partly processed. A similar unpredictable event is the occurrence of an interrupt. However, for example, consider the execution of a sequence of instructions including a conditional branch instruction as given below.

I.

  • 12 (conditional branch to IK)
  • 13
  • 14

I5

IK (branch target)

Ijc+i

Figure 8.14 illustrates the effects of the conditional branch during the execution of this sequence in our pipeline when the target path is selected. The instruction I2 is a conditional branch to instruction IK. Until the instruction I2 is executed (EX), there is no way of knowing which instruction will come next. The pipeline, in this example, simply loads the next instruction (I3) in sequence as usual and proceeds. The branch is taken only at the end of time unit 5, when a branch is ultimately taken to IK. All the instructions following the branch in the pipeline would then be useless and have to be flushed, thereby wasting a number of useful cycles already consumed. At time unit 6, the instruction IK ultimately enters the pipeline. No instructions are completed during time units 7-9; this is the performance penalty (branch penalty) incurred because the branch could not be anticipated. The number of pipeline cycles wasted (ill effects) between a branch taken and refilling of its branch target is called the delay slot; let it be denoted by d. In general, 0 < d < к- 1, where к is the number of pipeline stages.

The impact of such ill effects due to branching in this type of pipeline is computed and is given in the website: http://routledge.com/9780367255732.

8.5.2.3.2.2 Different Techniques The above analysis implies that the pipeline performance is adversely affected due to the presence of branch instruction, and the degradation seems to be appreciable when the instruction stream is sufficiently long. To reduce the negative effects of conditional branching on processor performance, several techniques have been proposed. Some of the better known techniques are as follows:

FIGURE 8.14

Branch penalty for a conditional branch.

i. Prefetch branch target

ii. Multiple streams

iii. Loop buffer

iv. Branch prediction

v. Delayed branching.

Here the instruction pipelining is implemented using certain mechanisms. This includes a brute-force approach to replicate the initial portions of the pipeline by introducing instruction buffers at the instruction fetching stage. While the second stage in the pipeline is executing the instruction, the fetching stage takes advantage of any unused memory cycles to fetch and buffer the instruction. This is called instruction prefetch or fetch overlap. In one memory-access time, a block of consecutive instructions are then fetched into a prefetched buffer. This block access can be achieved using a cache memory or even using interleaved memory modules to shorten the effective memory-access time, and hence speedup the pipeline operation. RISC processor MIPS 4000 uses this approach.

  • 8.5.2.3.2.2.1 Prefetch Branch Target When a conditional branch is recognised at the fetch stage, the target instruction of the branch is prefetched, in addition to the instruction following the branch instruction as usual. This target is then saved until the branch instruction is executed. If the branch is now actually taken after checking the given condition, the target has already been prefetched and is available. The large IBM system 360/91 model has used this particular approach.
  • 8.5.2.3.2.2.2 Multiple Streams In this type of design, the processor fetches both possible paths, and hence, two sets of buffer pairs, namely, sequential buffer and target buffer, are employed to service the pipeline, making use of two different streams. In normal execution, a pair of sequential buffers is used to load sequential instructions from the next sequential address of the branch instruction for in-sequence pipelining. At the time of detecting a conditional branch instruction, the sequential buffers are filled with next block of sequential instructions after the branch instruction as usual, and the target buffers are filled with the instructions from the branch target address of the branch instructions for out-of-sequence pipelining. This is illustrated in Figure 8.15. Once the branch condition is executed and the branch decision is made, appropriate instructions are taken from one of the two respective buffer pairs, and the instructions in the other buffer pairs, hence, will be simply discarded. In each pair, one buffer is used to load instructions from memory,

FIGURE 8.15

The use of sequential and target buffers.

and the other buffer in the same pair is used to feed instructions in the pipeline. In this way, the two buffers in a set alternate to prevent a collision between instructions flowing into and flowing out of the pipeline.

This strategy of double buffering to have two or more pipeline streams as described ensures a constant flow of instructions and data to the pipeline and reduces the time delays caused by the draining and refilling of the pipeline and thereby improves the performance. Some amount of performance degradation, however, is still unavoidable whenever the pipeline is drained at any point of time. This approach has been successfully implemented in IBM 370/168 and IBM 3033 large machines.

  • 8.5.2.3.2.2.3 Loop Buffer In case of sequential instructions contained in a small loop, a loop buffer is used as a prefetched buffer. A loop buffer basically is a very-high-speed small memory attached with instruction fetch stage of the pipeline and containing n most recently fetched instructions in sequence. Instructions in the loop body are prefetched as usual and will be executed in sequence if there is no branching out of the loop, until all iterations complete execution. If a branch is detected which falls within the loop boundary, and if the target instruction is already in the loop buffer, unnecessary memory access can be avoided; otherwise standard procedures are followed to handle branch conditions. In essence, the loop buffer works in a similar manner in principle to an instruction cache dedicated for instruction execution. The differences are that the loop buffer only retains instructions in sequence, while the cache contains instructions in the most recentlyused order. The loop buffer is much smaller in size, and hence the cost is lower, but the speed is higher. Both CDC 6600/CDC 7600/star-100 and CRAY1 supercomputer have exploited the loop buffer technique. Motorola 68010 implemented a loop buffer in a specified form for executing a three-instruction loop using the DBcc (decrement and branch on condition code) instruction. A three-word buffer is maintained, and the processor executes these instructions repeatedly until the loop condition is satisfied.
  • 8.5.2.3.2.2.4 Branch Prediction Another increasingly popular and more powerful technique is to predict the outcome of a branch decision before the branch is actually executed. Therefore, based on a particular prediction, either the sequential path or the target path is chosen beforehand for execution. Although the chosen path often reduces the branch penalty, it may also increase the penalty in case of incorrect prediction. Various techniques can be used to predict whether a branch will be taken. There are two types of predictions: static and dynamic.

In static prediction, a fixed decision for prefetching one of the two paths is made during compilation and before the program runs. The static prediction direction (taken or not taken) is usually wired into the processor. The wired-in static prediction cannot be changed once designed into the hardware. Static types, however, consists of three approaches:

  • 1. Predict-always-taken
  • 2. Predict-never-taken
  • 3. Predict-by-opcode.

The first one is a simple technique that would be to always assume that the branch will be taken. This technique always loads the program counter with the target address when a branch instruction is encountered. The second one assumes that the branch will not be taken at all and continues to fetch instructions in sequence as usual. The third approach

makes the decision based on opcode of the branch instruction. It is assumed that the branch will be taken for some branch types based on opcodes, and automatically choose the target path and sequential path for the rest of branch types. It has been observed that the success rate with this strategy is greater than 75%. However, if the chosen path is wrong, the pipeline is drained, and the instructions corresponding to the correct path are fetched; the penalty is paid. Each of these three approaches takes its own individual specific decision before the arrival of any conditional branch instruction. Such a decision can even be modified allowing the programmer or compiler to select the direction of each branch on a semi-static prediction basis. The complexity of the compiler in that situation will be naturally increased. Static prediction methods usually require little extra hardware. They do not consider the runtime history generated from previously executed branch instructions during the program execution upto the time that the next conditional branch instruction appears. The Motorola 68020 and the VAX 11/780 use the predict-never-taken approach.

A good example in the use of a static prediction mechanism possibly lies in the implementation of Intel Pentium 4 processor. The architecture of Pentium 4 consists of two cells. The outer cell of this processor is a CISC-like cell that fetches the instruction from the memory strictly in the order of the program submitted by the user, thereby resembling a CISC-like approach. The inner shell of this processor, however, consisting of a 20-stage pipeline strictly follows the RISC philosophy (for more details, see superscalar architecture) using RISC-like micro-operations (instructions) that are obtained by translating (decoding) each instruction of a user-submitted program by a fetch/decode unit. The outer shell of Pentium 4 exploits static branch prediction using the front-end BTB (branch target buffer) that contains the instructions of a user-submitted program fetched already from memory by fetch/decode unit, to supply the needed instructions, based on the outcome of the execution of branch instruction.

Dynamic (branch strategies) prediction attempts to improve the accuracy of prediction during the execution of the program by making a decision based on the profile information (history) collected from the previously executed branches. For example, a simple technique would be to record the history of the paths taken by each of the last two branch instructions. If the executions of last two branch instructions have chosen the same path, that path will then be chosen for the execution of the current branch instruction. But if the two paths do not match, one of the paths could be then chosen at random for the current one. The amount of history to be recorded at runtime to keep track of the past behaviour of the branch instructions should be limited; otherwise this will, in reality, be infeasible to implement. Therefore, most dynamic predictions are determined with limited recent history accumulated in dedicated additional hardware.

One such approach known as counter-based branch prediction mechanism to implement a prediction scheme is to associate an n-bit counter with each branch instruction. In this method, after executing a branch instruction for the first time, its counter C is set to be a threshold T if the target path is taken or to T - 1 if the sequential path is taken. From then on, whenever, the branch instruction is about to be executed, this history is consulted for prediction. If С > T, then the target path is taken; otherwise the sequential path is followed. The counter value C is updated after the branch is resolved. If the correct path is the target path, the counter is incremented by 1; if not, C is decremented by 1. If C ever reaches the upper bound, i.e. 2" -1, C will no longer be incremented, even if the target path is correctly predicted and chosen. Likewise, C is never decremented if it reaches the lower bound, i.e. 0. Observations and studies over real-time environments dictate that the performance of a 2-bit predictor (n = 2) and the value of T = 2 is almost the same as that of predictors

FIGURE 8.16

State transition diagram of a 2-bit predictor.

with a higher number of bits (higher value of n). Figure 8.16 represents a state transition diagram of the possible states in a 2-bit predictor.

Now, the obvious question that arises is, strategically at which stage in the pipeline will the branch prediction action be taken during runtime? Dynamic branch strategies can be categorised into three major classes:

  • • One class predicts the branch direction based upon information available at the decode (DI) stage.
  • • Another class uses a cache to store target address at the stage the effective address of the branch target is computed (CO).
  • • The third scheme uses a cache to store target instructions at the fetch stage (FI).

In fact, whatever strategy is being followed, it is to be noted that all dynamic predictions are accordingly adjusted dynamically as the execution of the program gradually proceeds.

Dynamic prediction is based on speculation, and hence, branch prediction with speculative execution requires extensive ILP — supported by temporary data registers, BTB, multiple e-units, and so forth that together consequently increase the hardware complexity. But this technique, if implemented adequately and reasonably, would then subsequently require relatively less work at the time of compilation. Unfortunately, the then existing hardware technology was not capable enough to realise this technique properly. This is one of the reasons that branch prediction techniques were not widely used until the 1990s, when ongoing hardware technology further improved and subsequently matured down the years sufficiently to support it. In general, dynamic prediction usually offers better performance than its counterpart static prediction and also provides a greater degree of object code compatibility, since decisions are made after the time of compilation (runtime). This technique has, however, been commercially implemented with utmost success by both Intel in Pentium Pro (as also in the original Pentium) and also in Pentium 4 and by Motorola in its 68060 processor introduced sometime in mid-1990s. The Pentium processor uses a simple form of branch prediction. It chooses the same direction that it chose the last time when the branch was executed. This requires a table of last branch address values to be maintained for each branch instruction. For a program loop, the predicted direction is thus correct for all branches after the first branch, until the loop is exited.

A brief description of the implementation of branch prediction in Intel Pentium 4 is given in the website: http://routledge.com/9780367255732.

8.5.2.3.2.2.5 Delayed Branching Observations on the branch penalty being paid reveal that the branch penalty would be significantly reduced if the delay slot could be minimised to attain an almost zero penalty. The purpose of the delayed branching scheme is to eliminate or at least to significantly reduce the adverse effect of the branch penalty. In this type of design, a certain number of instructions after the branch instruction are fetched and executed regardless of which path will be chosen for the branch. These instructions should be independent of the outcome of the branch instruction. A delayed branch of к cycles thus allows at most к - 1 such useful instructions' execution, irrespective of the fate of the execution of the branch instruction; otherwise, a zero branch penalty cannot be achieved. A delayed branch for two cycles (k = 2) is illustrated in Figure 8.17 which shows that one {k -1 = 2-1 = 1) delay instruction is required for this scheme.

For example, a processor with a branch delay of к executes a path containing the next к-1 sequential instructions and then either continues on the same path or starts a new path from a new target address. As often as possible, the compiler tries to fill the next к-1 instruction slots after the branch instructions with the instructions from the rest of the program that are independent from the said branch instruction. If no such instruction is available, NOP instructions are placed in any remaining slots. Consider the following example:

ij LD R1; I

  • 12 LD R2, J
  • 13 BrZr R2, i7 branch to i7 if R2=0
  • 14 LD R3, К
  • 15 ADD R4, R2, R3 -* R4 = R2 + R3
  • 16 MUL R5/ R-i / R4 -* R5 = Ri X R4
  • 17 ADD R4, Rx, R2 -* R4 = Rx + R2

Assuming a delayed branch for three cycles (k = 3) when the branch condition is resolved at the operand computation (CO) stage, the compiler modifies this code by moving the instruction i[ and inserting an NOP instruction after the branch instruction i3. The modified code is as follows:

  • 12 LD R2, J
  • 13 BrZr R2, i7

FIGURE 8.17

A delayed branch for two cycles when the branch condition is resolved at the DI stage.

ij LD R1; I NOP

  • 14 LD Ri; К
  • 15 ADD R4< R2, R3
  • 16 MUL R5, R1( R4 i, ADD R4, R4, R2

It can be seen in the modified code that the instruction i, is executed regardless of the outcome of the branch instruction. The instruction i4 could be taken in place of NOP. The technique as used here is similar to that used for software interlocking (discussed earlier). NOP behaves here as a filler. The corresponding pipeline stages for this modified code are illustrated in Figure 8.18.

The delayed branch strategy was, however, explored with the advent of RISC machines and is often employed in these processors now. Here, the processor computes the result of conditional branch instructions before any unusable instructions have been prefetched. With this approach, the processor always executes the single instruction that immediately follows the branch. This keeps the pipeline full to maximise the utilisation of the instruction pipeline while the processor fetches a new instruction stream. The delayed branch strategy was, however, subsequently perceived to be less appealing for many reasons as technological developments progressed, when more powerful and productive superscalar machines (described later in this chapter) were introduced. This strategy was then found to be unconducive and perhaps less appropriate to the basic design philosophy based on which the relatively advanced superscalar machine was built.

The branch prediction technique, on the other hand, is however, considered to fit mainly in the domain of pre-RISC culture. It regained importance when the more advanced superscalar machine, under certain compulsions due to its design philosophy, started to exploit this technique for its own convenience. Some processors, like PowerPC 601, use a simple static branch prediction technique. More sophisticated superscalar processors such as the PowerPC 620, the Pentium 4, and most other superscalar machines use the traditional dynamic branch prediction technique based on accumulated branch history analysis to improve their efficiencies.

FIGURE 8.18

Code modification for a delayed branch of three cycles when the branch condition is resolved at the CO stage.

A solved real-life example showing the usefulness of branch prediction method is given in the website: http://routledge.com/9780367255732.

 
Source
< Prev   CONTENTS   Source   Next >