Superpipelined Superscalar Processors
A superscalar processor can be improved further to obtain a more enhanced pipeline performance using the superpipelining approach in its functional stages. When a machine executes m instructions every cycle (superscalar of degree m) with a pipeline cycle l/uth (n pipeline stages in one base cycle) of the base cycle, it is called a superpipelined superscalar machine of degree (m, n).
Figure 8.23 shows such a machine of degree (m, n) = (2, 2). The maximum level of parallelism (considering machine parallelism and ILP) that can be achieved while all the stages are full is mn as compared to scalar machines.
Superpipelined superscalar execution with degree m = n = 2.
Superpipelined Superscalar Performance
With an ordinary scalar pipeline machine having к stages, the minimum time required to execute N independent instructions is (assuming pipeline cycle time for each stage to complete is one clock cycle):
With a superpipelined superscalar machine of degree (m,n), the minimum time required to execute the same N independent instructions is
Thus, the ideal speed-up gained by a superpipelined superscalar machine over the base machine is
It is obvious that the speed-up S (m, n) —> mn as N —>
Implementation:Superpipelined Superscalar Processors
Although the superpipelined superscalar architecture was originally implemented in various ways mainly in RISC machines, its principles were equally used later in RISC-like CISC systems also, especially in Intel Pentium machines. Many commercially popular RISC machines introduced by reputed vendors have, however, successfully implemented this principle, but we have considered here only the DEC ALPHA 21X64 family as an example of this class that stands at the forefront of the RISC tent. Similarly, we have taken Intel Pentium series, especially Pentium 4, as an example from the CISC (more RISC-like) tent.
DEC Alpha 21X64
The Alpha architecture 21X64 family, a RISC-based design of a superpipelined superscalar processor with 64-bit address and 64-bit data size was introduced by DEC in 1992 as the successor of its most reputed VAX family. The first member of this family was 21064. Later, it was modified, and DEC launched the 21164 processor in 1994. This processor was further upgraded, and 21264 was introduced in 1998. The design of this processor family has a close resemblance to that of HP 8500 in terms of the type of the components used and the design methodology followed. This design imposes more stress on speed, multiple pipelined operation units to support the issue of multiple instructions (superscalar), multiprocessor applications, and software migration from and compatibility with the then VAX/VMS and MIPS Ultrix systems.
A brief description of this processor witha figure is given in the website: http://routledge. com/9780367255732.
Intel Pentium 4
Intel Pentium 4 arrived in June 2000 is a superpipelined superscalar processor using 42 million transistors, fabricated with 0.18 pm CMOS process having a clock speed that varies from 1.4 to 1.7 GHz. At 1.5 GHz, the processor delivers 535 SPEC int2000 and 558 SPEC fp2000 of performance. Unlike that of all the earlier members of Pentium family (P6 architecture) including Pentium 3, its NetBurst architecture is a unique one. The outer shell (frontend) of this processor is like that of CISC, while the inner shell strictly follows the RISC philosophy. The frontend of the processor fetches the instruction from the memory strictly in the order of the program submitted by the user, thereby religiously following the CISC-like approach, and nothing else. The remaining steps required for the execution of each instruction are similar to those of a RISC processor. The fetched instruction (executable code of the source program) is then translated (decoded) into one or more fixed-length RISC-like instructions known as micro-operations. These micro-operations are then executed by the superpipelined superscalar processor, and at that time, these may be rearranged out-of-order as demanded by the pipeline structure for smooth operation through the pipeline to realise improved performance. Similar to a RISC processor, here also the result of the execution of each micro-operation is stored on the processor's register set following the order of the original program flow.
The Pentium 4 architecture may be viewed as consisting of four basic modules: (i) frontend module, (ii) out-of-order execution engine, (iii) execution module, and (iv) memory subsystem module.
The front-end module of Pentium 4 contains: (i) IA-32 instruction decoder, (ii) trace cache, (iii) microcode ROM, and (iv) front-end branch predictor.
Pentium 4 using a six-way superscalar design can dispatch six micro-operations in one cycle that pass through a pipeline built with hyper-pipelined technology with pipeline depth extending to 20 stages (superpipeline). The machine instructions (executable codes of the source program) are brought from main memory/L3 cache into an on-chip 256 KB non- blocking and eight-way set-associative L2 cache located at the front-end of the processor. The front-end BTB and I-TLB (instruction translation lookaside buffer), however, assist the unit during its fetching operation. The decoder then translates each machine instruction into one to four micro-operations (instructions), each of which is a 118-bit RISC instruction. All these micro-operations generated are then stored in LI instruction cache (which stores up to 12K decoded micro-operations in the order of program execution), also called execution trace cache. In addition, LI contains an 8 KB data cache. The actual pipeline begins from this LI cache onwards. The front-end (outer shell) of Pentium 4 uses static branch prediction using the front-end BTB to determine which instructions are to be fetched next.
The inner shell of Pentium 4 starting from LI cache onwards actually includes the pipeline and employs a dynamic branch prediction strategy based on the history of recently encountered branch instructions stored in a four-way set-associative BTB cache having 512 lines. Each entry uses the address of the branch instruction as a tag and also includes the branch destination address that this branch took the previous time and a 4-bit history field. The algorithm being used is referred to as Yeh's algorithm which provides a significant reduction in misprediction compared to algorithms that use 2 bits to maintain history. Pentium 4 includes out-of-order execution logic, and this part of the processor reorders micro-operations that are delivered from LI cache (trace cache) in order to enable them to be executed promptly as soon as their input operands are available. Pentium 4
performs register renaming in its renaming stage within the pipeline that remaps references to the 16 architectural registers (8 floating-point registers plus EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP) into a set of 128 physical registers. This stage eliminates false dependencies caused by a limited number of architectural registers while preserving the true data dependencies (reads after writes, RAW). Pentium 4 also employs schedulers that are responsible for retrieving micro-operations from the micro-operation queues and dispatching these for execution following first-in-first-out (FIFO) ordering whenever there is a competition between different concurrent micro-operations for a specific execution unit, thereby often favouring in-order executions. The execution units of Pentium 4 retrieve operand values from LI data cache as well as from integer and floating-point register files which are considered to be the source of pending operations to be executed by the execution units. Two ALUs on this processor are clocked at twice the core processor frequency that allows the basic integer instructions such as Add, Subtract, logical AND/OR, etc. to be executed in half a clock cycle.
A brief description of this processor with its figure is given in the website: http:// routledge.com/9780367255732.
VLIW and EPIC Architectures
The traditional machine instructions specify mostly one operation each, e.g. load, store, add, multiply, etc. As opposed to this norm, an instruction may support multiple operations, which would then necessarily require a larger number of bits to encode. Therefore, processors with this type of instruction word are said to have a very long instruction word (VLIW). When the processor architecture is designed and developed based on this principle, it is known as a VLIW processor. As this processor uses more functional units than a usual superscalar processor, the cycles per instruction (CPI) of a VLIW processor can be even lower. This architecture could, however, be thought of as a hybrid form of both horizontal microcoding (see Chapter 6) and superscalar processing. The concept of horizontal microcoding has been implemented using long instruction words, typically hundreds of bits (256-1024 bits per instruction) in length. Superscalar processing is realised by providing multiple functional units that are used in parallel within this processor. In VLIW, each instruction specifies multiple operations. Different fields of the long instruction word carry different opcodes (see horizontal microprogramming, Chapter 6) that are dispatched to the different functional units simultaneously and can be executed in parallel with independent data operands. Normal programs written usually in conventional short instruction words (i.e. 32 bits) must be clubbed together and then compacted in a legible way to form the VLIW instructions. This code compaction must be done by a compiler which will take all possible measures to optimise the generated object code including branch prediction for smooth operation of these long-word instructions (Colwell, R. P., et al.).
Further refinement of this concept ultimately leads to what is called explicitly parallel instruction computer (EPIC) (Kathail, B.). The EPIC instruction format, however, is more flexible than the fixed format of multi-operation VLIW instructions; for example, it may allow the compiler to explicitly encode dependences between operations.
The ultimate objective behind both VLIW and EPIC processor architectures is to assign to the compiler the primary responsibility to exploit the plentiful hardware resources available in the processors in parallel. Theoretically, this would not only reduce the complexity of the processor hardware, but also provide overall increased processor throughput. Thus, this approach could be considered, at least in theory, as an emergence of a third alternative being offered, apart from the existing RISC and CISC styles of processor architecture.
It is not unfair to say that VLIW and EPIC concepts, in general, have not kept their original commitment. Intel Itanium 64-bit processors (McNairy, C.) make up the most well-known processor family of this class. Experiences with this processor have revealed, as was also argued in many other areas, that processor hardware does not really become simpler, even when the compiler usually bears primary responsibility for the detection and exploitation of ILP. Events such as interrupts and cache misses still remain unpredictable, and therefore, execution of operations at runtime cannot religiously follow the static scheduling specified by the compiler in the VLIW/EPIC instructions; dynamic scheduling is thus urgently required.
These processors have been mostly implemented with microprogrammed control. The clock rate is thus slow due to use of ROM. The execution becomes even slower since some instructions may require a large number of microcode-access cycles to complete their operation. Although VLIW machines are found to behave in a manner very similar to that of superscalar machines, they essentially have some notable differences:
- • The code density of the superscalar machine is better than that of VLIW.
- • The available ILP in a superscalar machine is comparatively lesser than that which can be realised in a VLIW machine. This is mostly due to the fact that the VLIW instruction often includes bits of non executable operations in order to keep its format religiously fixed. The superscalar processor in this regard issues only instructions of executable operations.
- • The ILP in a VLIW machine is totally accomplished at the time of compilation, and the performance of a VLIW processor depends heavily on the efficiency of the code compaction. Superscalar machines possess different architectural characteristics and provide ILP differently in this regard.
- • The CPI of a VLIW processor can be lower than that of a superscalar processor.
- • The decoding of VLIW instructions is relatively easier than that of superscalar instructions.
- • The object code used in superscalar machines is mostly compatible with a large family of nonparallel machines. VLIW machines, on the other hand, exploit different amounts of parallelism that usually would require wide varieties of instruction sets.
However, in summary, it can be concluded that a VLIW processor still seems to be an extreme of a superscalar processor in which all independent or unrelated operations are already synchronously compacted in advance before a run.
The distinct advantage of VLIW and EPIC architectures lies in their relative simplicity in hardware structure and the underlying instruction set. Parallelism is explicitly encoded in the long instruction word that consequently eliminates the need for additional appropriate hardware and software required to detect parallelism.
One of the major shortcomings of VLIW architecture is its lack in compatibility with conventional hardware and software that consequently puts this architecture in a position of not being able to provide good performance. As the working of VLIW/EPIC architecture greatly depends on compiler-generated ILP, the practical difficulty of such an approach is that the source program often may have to be recompiled even for a different processor model of the same processor family. The reason is simple: such a compiler depends not only on the instruction set architecture (ISA) of the processor family, but also on the hardware resources provided in the specific processor model for which it generates codes.
But for highly computation-intensive applications (operations involving matrices) which usually intend to run on specified hardware platforms, this strategy may then well be feasible and even work nicely, and consequently, it may yield significant performance benefits. Such special-purpose applications can even be fine-tuned for a given hardware platform and then could be run profitably on a regular basis for long periods on the same dedicated platforms.
In case of commonlyused programs, such as word processors, spreadsheets, web browsers, etc., they must run without any such recompilation on all processors of a specific family. Most users of software actually do not have source programs to recompile, and all the processors of a specific family are expected to be compatible with one another in terms of instruction sets. Therefore, the role of compiler-generated ILP is limited in the case of widely-used general-purpose application programs of the types mentioned.
Furthermore, the ILP that is explicitly implanted in the compacted code may require different amounts of latency by different functional units, even though the related instructions are issued at the same time. This leads to a situation that the same VLIW architecture when implemented differently in different machines becomes binary-incompatible with one another. That is why the VLIW architecture has never entered the mainstream of computers. Although the idea seems sound in theory, its extreme dependence on code compaction and trace-scheduling compilation for improved performance has eventually prevented it from being widely accepted and hence dropped from favour, especially in the arena of commercial applications.
Instruction Bundles:the Intel IA-64 Family
Similar to VLIW/EPIC concept, one notable feature out of many distinctive aspects of IA-64 is that three 41-bit instructions are grouped into a 128-bit bundle, along with a 5-bit field called the template, which specifies compiler-derived information about how instructions can be executed in parallel. For example, one of the template codes indicates the location of a stop, which marks the end of a group of instructions that can be executed in parallel. Such a group may extend over a number of bundles. Information in the templates is used by the processor to schedule the parallel execution of such grouped instructions on multiple functional units to achieve a superscalar operation, thereby exhibiting a close resemblance with the characteristics of EPIC which itself is considered as an extension of the concept of VLIW instruction set design (Colwell 1988).