A multicore processor, also known as a chip multiprocessor, combines two or more processors (called cores) on a single piece of processor chip (a single piece of silicon, called a die). Typically, each core consists of all of the components of an independent processor, such as ALU, registers, control unit, and pipeline hardware, plus LI instruction and data caches. In addition to the multiple cores, the contemporary multicore chips also includes L2 on-chip cache and, in some cases, even L3 on-chip cache. A multicore processor however provides, in essence, some kind of a structural parallelism within a processor.
The primary design choice for the multicore organisation of a single processor involves certain critical factors, such as the number of processor cores to be used in the chip, the number of levels of cache memory to be included, and the extent to which the cache memory is to be shared (Olukotun, K., et. al.). The fundamental parameters that are embraced with these factors can be considered as a trade-off involved in this context that can be expressed in a more simplified way as
For the sake of simplicity, it has been assumed that the cache and interconnect area (and transistor count) can be considered proportionately on a per core basis.
At a given time, VLSI technology limits the left-hand side in the above equations while the designers must decide as to what weight needs to be given individually to the two factors appearing on the right-hand side of the said equation. Aggressive exploitation of ILP, with multiple functional units and more complex control logic, increases the chip area (and transistor count) per processor core. Consequently, the number of cores that could be realised within the processor would be fewer. Alternatively, in a design approach catering to a different category of target applications, the designers may select simpler cores consuming lesser chip area per core, thus using lesser number of transistor counts and consequently placing a larger number of such simpler cores on a single chip. Of course, practical system design would actually involve such issues which are much more complex and even more critical than these; still a basic design issue is observed here: For the targeted application and desired performance, how should the designers divide available chip resources among processor cores and, within a single core, among its various functional components? (Jerraya, A., et al.)
Organisation of multicore systems, as being viewed, essentially includes three main decisive parameters mentioned below:
- • The number of processor cores on the chip
- • The number of levels of cache memory to be employed
- • The amount of cache memory that is to be shared.
Each parameter is again specified by its number, size, capacity, capability, its placement, and also the way in which it would interact with the others and will eventually determine the type of the organisation that the multicore system would have.
Four general organisations of the contemporary multicore systems are depicted in Figure 8.32. Figure 8.32a illustrates an organisation in which the only on-chip cache is LI cache which is again divided into instruction and data caches, with each core having its own dedicated LI cache. This type of organisation is often found in some of the earlier multicore computer chips and is still seen in use in some embedded chips. An example of this organisation is the ARM11 MPCore. The organisation as shown in Figure 8.32b indicates that there is enough room available on the chip that enables it to build a separate
Different possibilities of multicore organisation, (a) Dedicated LI cache, (b) dedicated L2 cache, (c) shared L2
cache, and (d) shared L3 cache.
on-chip dedicated unified L2 cache, apart from the existing on-chip LI split cache (I-cache and D-cache) in each core. This type of multicore organisation is found in AMD Opteron. Figure 8.32c shows an arrangement almost the same as that shown in Figure 8.32b but with a little difference that the on-chip L2 cache in Figure 8.32c is a larger one and is used here in a manner shared by all cores. The Intel Core Duo processor UltraSPARC T2 has this organisation. With the increasing potential of VLSI technology that provides the system designer with an abundance of hardware capabilities, the amount of space now available in the chip as well as the total transistor count obtainable on the chip continues to grow. This provokes the designers to include, for the sake of improving the performance further, a separate shared unified L3 cache, apart from usual dedicated LI and L2 caches for each core processor. This organisation is illustrated in Figure 8.32d. The implementation of Intel Core i7 is an example of this organisational approach.
Now the burning question arises as to how these on-chip caches will ultimately be deployed. Whether the on-chip cache (especially L2 cache) would be used in a shared or dedicated manner to realise a higher net throughput; this is essentially a typical design trade-off. Both the shared and the dedicated approaches, however, have their own merits and drawbacks. But the use of shared on-chip L2 cache usually shows several distinct advantages over exclusive dependence on corresponding dedicated caches. Some of the notable advantages are as follows:
- • Shared cache, by virtue, usually reduces overall miss rates. For example, a thread (or a process) is being executed on one core while accessing a main memory location brings the corresponding frame containing the referenced location into the shared cache. If another thread (or process) on a different core soon thereafter intends to access the same memory location, the targeted memory block will already be available in the shared on-chip cache. In this way, it not only reduces the total miss rate, but also allows the thread (or process) to avoid repeated and expensive memory visits.
- • Data shared by multiple cores need not be replicated at the shared cache level.
- • If the frame replacement algorithm is properly devised, the amount of shared cache allocated to each core would be essentially dynamic so that threads (or processes) that have less locality of reference can make use of more cache memory.
- • Presence of shared L2 cache arrests the critical cache coherence problem only at the LI cache level, which in turn may offer an additional performance advantage.
On the other hand, the use of a dedicated L2 cache in each core on the chip shows some potential performance advantages. Each core would now have the benefit of more rapid and direct access to its private L2 cache. This is especially advantageous for threads (or processes) that usually bear strong locality. But at the same time, one of the major drawbacks of using dedicated L2 cache is that the critical cache coherence problem in this situation may even propagate to the L2 cache level in addition to the already affected LI cache level.
Due to the continuous improvement in VLSI technology, the main memory size is constantly increasing, and the number of cores that are possible to be fabricated in a chip is also steadily growing. Consequently, the use of a shared L3 cache combined with either a shared or a dedicated L2 cache per core is likely to provide a higher net throughput than simply straightaway employing an extremely large size of shared L2 cache in place.
Last but not least, another important design decision in the multicore system organisation is whether the individual cores will be superscalar or will implement SHMT. In recent years, there is a clean shift in system design away from structural parallelism (superscalar) and towards support for fine-grained structural parallelism (hardware threads). The basic driver behind such a shift is simple: achieving maximum performance for a given system cost. Development of multicore SHMT architecture is a clear consequence of such a shift. For example, the Intel Core Duo uses superscalar cores, whereas the more advanced Intel Core i7 uses SHMT cores. One of the main reasons behind this approach is that the SHMT has the outcome of scaling up the number of hardware-level threads that the multicore system provides. Thus, a multicore system with four cores and the SHMT that supports three simultaneous threads in each core appears to be the same as a multicore system with 12 cores at the application level. As the software development for a given range of applications on a platform having steady growth in VLSI capabilities proceeds to meet the grand challenges of fully exploiting parallel resources, an SHMT approach in this scenario appears to be more conducive and supportive than its counterpart, the superscalar approach.
Basic Multicore Implementation: Intel Core Duo
Intel has continuously introduced a number of more modern multicore products using the constantly evolving more advanced VLSI technology to achieve an even higher performance while preserving the backward compatibility to keep up with its family concept. These processors with their standard CISC instruction set have combined RISC design standard techniques, such as micro-operation pipeline, multiple functional units, and out-of- order sequencing, in their internal structure. The original Core brand refers to Intel's 32-bit mobile dual-core X86 CPUs that were derived from a more enhanced version of the Intel P6 microarchitecture, the Pentium M branded processors. The Core brand comprised two branches: the Duo (dual-core) and Solo (Duo with one disabled core, which replaced the Pentium M brand of single-core mobile processors). It emerged in parallel with the NetBurst microarchitecture (Intel P68) of the Pentium 4 brand and was a precursor to the 64-bit Core microarchitecture of Core 2 branded CPUs.
Intel's first dual-core mobile (low-power) processor was launched on January 6,2006, by the release of the 32-bit Yonah CPU using a fabrication technology of 265 nm, and it was targeted mainly for laptop use. Its dual-core layout, contrary to its name, had more in common with two interconnected Pentium M branded CPUs packaged as a single die (piece) silicon chip (IC) than with the subsequent 64-bit Core microarchitecture of Core 2 branded CPUs. Within a short period, Intel however launched many other similar processors having almost identical architectures but with improved fabrication technology consisting of two X-86 superscalar core processors (that is why it is called Core Duo) with no provision for Intel's own hyper-threading technology. The Core Duo processor was operated at a speed of 1.66GHz, and as usual, it was equipped with a dedicated LI cache offered to each core and a large shared L2 cache. The width of the instruction set was 32 bits.
In this section, we will be looking at only the basic multicore architectures, the notable ones, the Intel Core Duo with a brief description of all its key elements and their salient features with respective activities to give an overall understanding in relation to fundamental multicore aspects. The general structure of Intel Core Duo with its major functional elements is shown in Figure 8.33.
Each core here is equipped with a separate independent thermal control unit to manage power consumption and related dissipation of heat generated in this kind of high- density chip to yield maximum processor performance within the confines of existing
Schematic block diagram of Intel Core Duo.
thermal constraints. This unit also improves ergonomics with a cooler system and lower fan acoustic noise and monitors digital sensors for high-accuracy die (chip) temperature measurements. The maximum temperature of each core (an independent thermal zone) is reported separately via dedicated registers that can be polled by software. If the temperature in a core at any point of time exceeds a threshold, the thermal control unit at once reduces the clock rate for that core to reduce power consumption and thereby trim down heat generation.
The Advanced Programmable Interrupt Controller (APIC) performs many important functions including the following:
- • The APIC provides interprocessor interrupts, which allow any process to interrupt any other processor or set of processors. When a process (or a thread) in one core generates an interrupt, it is received by the local APIC and routed to the APIC of another core as an interrupt to that core.
- • The APIC accepts I/O interrupts and routes these interrupts to the appropriate cores.
- • Each APIC is equipped with a timer which can be set by the operating system to generate an interrupt to the respective local core.
The power management logic unit takes care of the power consumption and monitors thermal conditions and CPU activity, and adjusts voltage levels, thereby increases battery life for mobile platforms, such as laptops mobile phones, etc. The VID voltage (Voltage Identification used by Intel; which is the set/stock voltage for a given clock speed) range within which the processor chip operates is, however, 1.0-1.212 V. This unit includes an advanced power-gating capability that allows an ultra-fine-grained logic control to turn on individual processor logic subsystems if and only if they are needed. In addition, many buses and arrays are split so that data required in some modes of operation can be put in a low-power state when not needed. However, the power consumption still remains around 15 watt.
As usual with all multicore systems, each core here also has its own dedicated LI split cache consisting of a 32-KB instruction cache and a 32-KB data cache. The processor includes a shared 2 MB L2 cache. The cache logic permits a dynamic allocation of shared cache space based on the needs of the currently executing core, so that one core can be assigned even up to 100% of the L2 cache space. The L2 cache includes logic to support the MESI protocol (to resolve cache coherence, see Chapter 4) for the attached LI caches. When a cache-write is carried out at the LI level, the cache line gets the M (modified) state when a processor writes to it; if the line is not in E (exclusive) or M state prior to writing to it, the cache sends a Read-For-Ownership (RFO) request that ensures that the line is present in the LI cache and is in the 1 state in the other LI cache. The Intel Core Duo, however, extends this protocol to accommodate the particular situation when there are multiple Core Duo chips organised as a symmetric multiprocessor (SMP) system (discussed in Chapter 10).
Intel Core Duo is equipped with an arbiter bus; the bus interface controls L2 cache and provides the connection to the external bus, known as the front-side bus, which operates at a speed of 667MHz with no provision of parity. This bus in turn connects to the main memory, I/O controllers, and other processor chips.
More details about Intel Core Duo processor are given in the website: http://routledge. com/9780367255732.
Intel Core 2 DUO
The majority of the desktop and mobile Core 2 processor variants were Core 2 Duo with two processor cores on a single Merom, Conroe, Allendale, Penryn, or Wolfdale chip. The Allendale was launched sometime in January, 2007, using a 265 nm fabrication technology, and Wolfdale was launched sometime in January, 2008, using a 265 nm fabrication technology. All these came with a wide range of performance and power consumption, starting with the relatively slow ultra-low-power Uxxxx (10 W) and low- power Lxxxx (17 W) versions to the more performance-oriented Pxxxx (25 W) and Txxxx (35 W) mobile versions and the Exxxx (65 W) desktop models. The mobile Core 2 Duo processors with an "S" prefix in the name are produced in a smaller pFC-BGA 956 package, which allows building more compact laptops. Each version comes with a number of products (chips), each product uses a chronologically higher product number than its predecessor. Products with higher numbers given to a specific version as mentioned above usually refers to a better performance which depends largely on the core, the clock frequency of the front-side bus, and the size of the second-level cache which are again modelspecific. Core 2 Duo processors typically use the full L2 cache of 2, 3, 4, or 6 MB available in the specific stepping of the chip, while versions with the reduced cache size are sold for the low-end consumer market as Celeron or Pentium Dual-Core processors. Like these processors, some low-end Core 2 Duo models also disable features such as Intel Virtualization Technology.
Intel Core 2 Quad
Core 2 Quad processors are multichip modules consisting of two dies similar to those used in Core 2 Duo, forming a quad-core processor. This allows nearly twice the performance of a dual-core processor at the same clock frequency in ideal conditions. Initially, all Core 2 Quad models were versions of Core 2 Duo desktop processors, Kentsfield derived from Conroe launched in January 2007 with 465 nm fabrication technology and Yorkfield derived from Wolfdale launched in March 2008 using a 445 nm fabrication technology, but later Penryn-QC was added as a high-end version of the mobile dual-core Penryn. The Xeon 32xx and 33xx processors are mostly identical versions of the desktop Core 2 Quad processors and can be used interchangeably.
A brief description of Intel Core 2 Quad with a relevant table is given in the website: http://routledge.com/9780367255732.