CPU while accessing a cache can address the cache either by a physical address or by a virtual address irrespective of whether it is a unified cache or a split cache. This leads to two different models in cache design.
Physical Address Cache
When a physical memory address is used to access a cache, the cache is called a physical address cache. This model is applicable both on unified caches as well as on split caches.
When physical address cache is used as a unified cache as implemented in Intel 486, in DEC VAX 8600 system, and in many other similar contemporary systems, the cache is indexed and tagged with a physical address. When the processor issues an address, this virtual address is as usual first translated into TLB or into MMU before any cache lookup. Each address is always uniquely translated, without ambiguity, to one cache item. Figure 4.29 shows a scheme of such a system.
The physical address cache is also found to be in use in split cache design, in which data cache and instruction cache are separate units, and both are accessed with a physical address after translation from an on-chip MMU. A multi-level data cache (here only two level is taken) can also be employed using a fast processor with an aim to minimize usual cache misses. Split cache design was implemented in Motorola 68030, and later, in a more improved form in Motorola 68040 which has also incorporated, in addition, two independent address translation caches (TLB) that permit the simultaneous translation of addresses in both instructions and data. Alpha AXP 21064 microprocessor uses an 8-KB instruction cache, and an identical 8-KB data cache. MIPS R3000 (RISC Architecture) CPU also uses this approach. Figure 4.30 illustrates the scheme of a split cache design.
As usual, the first-level D-cache is a smaller one, of about a size of 64KB and even more today, and uses the write-through policy while the second-level D-cache is of a slower speed but having a larger size of about 256KB /512КВ and use write-back policy. Here, the I-cache is usually a single-level one having a smaller size of about 64 KB. Most manufacturers' strategy is to prefer the first-level D-cache to be on-chip, and both second-level caches (D-cache and I-cache) to be off-chip.
Unified cache design when accessed by physical address.
Split cache design when accessed by physical address.
The advantages of physical address caches lie in its design simplicity which requires little intervention of the operating system. As it is accessed with physical addresses, no problem arises in accessing, since the cache has the same index/tag. Cache flushing is not required, if proper bus watching is provided on system bus while servicing the requests from CPU or DMA-I/O. One of the drawbacks of this design is the slowdown of cache access due to waiting for the completion of the address translation mechanism executed by MMU/TLB.
Virtual Address Cache
One of the major shortcomings of the physical address cache is its slowdown in accessing the cache until the MMU/TLB completes address translation. This drawback has been alleviated in the virtual address cache which is indexed and tagged with virtual address, hence it is called virtual address cache. Figure 4.31 shows an example of this scheme for a unified cache. The virtual address will go to the cache and MMU simultaneously. Translation of virtual address to physical address is done by MMU parallel with cache lookup operations. Hence, cache lookup is not delayed at all, rather the cache access efficiency is increased due to overlapping with the MMU address translation process. The MMU translation would yield physical main memory address which is saved for later use by the cache for write-back, if required. Since hits are much more common than misses, virtual addressing effectively eliminates address translation time in case of a cache hit.
The serious drawback associated with a virtual address cache is aliasing or synonyms problem which means that different logically addressed data is appeared to have the same index/tag in the cache. This is severe, particularly when multiple applications (processes) are executed in a system, and if two or more processes attempt to access the same physical cache location. There are various ways to solve this problem, and each system using virtual address cache has solved this problem from its own viewpoint. For example, UNIX has solved this problem with periodically flushing the cache or completely flushing the cache with each application context switch and handled it from its kernel level. SUN system has solved it using a Process Identifier Tag (PID) or a Physical Address Tag to each line of the cache to identify the application (process) that this address refers to. Whatever the approach be considered, the ultimate objective is to enhance the cache performance avoiding complexity as much as one can.
Figure 4.32 illustrates a scheme of virtual address cache used in split cache organisation. Exploiting virtual address cache design, Intel i860, a RISC processor has used split caches
Virtual address access in a unified cache.
Virtual address accessed in a split cache.
for data and instructions separately. The Data cache (D-cache) is 8K-bytes and Instruction cache (I-cache) is 4K-bytes, and both are of 32-bytes block size. Set-associative cache organisation (2-way) is implemented with 128 sets in D-cache, and 64 sets in the I-cache. The instruction length is 32-bits. Virtual addresses are also of 32-bits wide generated in the integer unit (IU), the physical address is, however, also 32-bits long.
Miss Rate and Miss Penalty
In practice, the designer's target is to achieve as high a hit ratio as possible at any level in the memory hierarchy. For a cache miss, it is called block miss, since the unit of transfer is block, and in case of main memory, it is called page fault because page is the unit of transfer. Every time a miss occurs, the next higher level of memory is consulted. This causes a substantial amount of additional time to be spent in accessing the next higher level of relatively slower memory. The CPU comes to a stall for a longer duration in such a situation until the response arrives. A huge penalty thus has to be paid every time when misses occur. The miss rate is thus one of the important parameters, and not the only one at the time of cache design. Miss rate is expressed as the number of accesses that have missed divided by the total number of cache accesses. To compute a rough estimate of penalty being paid due to cache misses, let us first assume that the total CPU time required for a program to execute is;
where, IC = Instruction Count = Total number of instructions executed, and CPI = Cycles Per Instruction.
If, for example, the clock cycle time (clock period) is 2 ns, then clock rate is 1/2 ns = 500 MHz.
In fact, CPU time must include not only the CPU clock cycles that CPU itself requires only to execute the instructions, but also the number of cycles during which the CPU is stalled, waiting for the response from memory which accesses the needed operand to provide, and this can be referred to as memory stall cycles. Hence,
In case of a cache hit, the CPU clock cycles is hardly affected, if at all affected, its impact is negligible, but when cache miss occurs, the CPU is stalled, and memory stall cycles increase due to the needed visit to a slower main memory. The number of such memory stall cycles depends on both the number of misses, and the number of extra cycles needed due to every miss (extra overhead), commonly known as miss penalty. Thus,
and the miss penalty can now be calculated with the following formula:
For example, if the memory access time of a computer system is 200 ns, and the clock rate is 500 MHz [i.e. clock cycle = 1/clock rate = 1/(500 x Ю6) s = 2 ns].
Then, Miss penalty = 200 ns/2 ns = 100 cycles
It is to be noted that miss rate and miss penalty are two different parameters creating impact on memory stall cycles, and thereby determining the CPU execution time. In case of cache hit, the memory stall cycles is taken as zero, and there is no penalty, otherwise it is computed separately, and this penalty is included in the equation of CPU execution time calculation as mentioned above.
Types of Cache Misses and Reduction Techniques
It is observed that in spite of having large size caches with adequate block size, and even using optimal replacement algorithm to keep the contents of caches always updated, modified, and relevant to all CPU requests, cache miss cannot be avoided, and is becoming a regular, natural, and normal event. However, for every cache miss, a huge penalty has to be paid, and hence, the ultimate objective of cache organisation is thus aimed to reduce the number of such cache misses as much as one can. Cache misses can occur due to numerous reasons. Let us now classify all possible cache misses that can happen into three distinct categories:
i. When the execution just begins, an attempt to access any cache block for the first time leads to a cache miss because the needed information cannot be in the cache block, and the required block is to be then brought into the cache to negotiate the situation. This is compulsory, inevitable, and cannot be avoided. These type of misses are commonly called cold start misses or even sometimes called first reference misses.
ii. Cache misses also occur since the cache cannot contain within its limited size, all the blocks referenced by the CPU all the time during execution. This is known as capacity miss which may occur because the blocks just thrown out of the cache to make room for other recently referenced entries are once again immediately demanded. Even use of the most optimal (efficient) replacement algorithm and effective cache organisation cannot always avoid this situation.
iii. Cache misses can also happen due to the replacement strategy being followed.
If the block mapping strategy is set-associative or direct mapped, conflict arises between blocks having the same index but different sets. Too many such blocks that have already arrived need to be mapped to their sets, but the room is limited. Conflict misses will occur because one such block needs to be thrown out to make room for a similar class of blocks, but unfortunately may be referenced within a short interval. These are called collision misses or interference misses.
Numerous techniques have been devised to reduce the number of cache misses, thereby reducing the miss rate. Unfortunately, many such techniques that reduce miss rates also increase hit time or miss penalty. However, the simplest classical way to reduce miss rate is to increase the block size. Larger block sizes will reduce compulsory misses, and also exploit the advantage of spatial locality. On the other hand, larger block sizes reduce the number of blocks in the limited size cache, thereby increasing collision misses, and even capacity misses, if the cache is small. All these collectively result in an increase in miss penalty that outweighs the gain obtained from a decrease in miss rate, and hence, a rigid implementation of this approach has been dropped from favour. Higher associativity is another classical technique that improves miss rates but that too at the expense of increased hit time resulting in an increase in miss penalty. It appears to be almost fully associative (minimal miss rate), if the size of available cache is larger. Other techniques that have been later developed to reduce miss rate with minimum impact in the miss penalty include: 
the program semantics. It has been observed from the outcome of many experiments using different types, and sizes of caches that this results in the notable improvement of spatial and temporal locality of the data, that summarily reduces the miss rate.
Miss Penalty and Reduction Techniques
From the day one of cache usage, it has been observed from numerous experiments with various types of caches as well as the cache performance formula reveal that the miss penalty is equally a dominant factor, and that its improvements could even yield a better result than that can be obtained by simply improving the miss rate. Moreover, the advancement of electronic technology and its current trends have been continuously improving the speed of processors, even faster than moderately improved main memory, resulting in miss penalties becoming more costly, and that too gradually increasing over time. Numerous attempts thus have been made to reduce this miss penalties, but most of the techniques improvised in this regard have impacts on CPU. The only technique that sets the CPU aside, and relieves it in this regard has been realized with the use of second- level caches.
It has also been observed that the use of faster cache to match with the CPU speed alone cannot be the only solution to reduce the performance gap between processors and main memory. Use of larger cache can also be deployed to overcome this problem by way of splitting it over two levels between processor and main memory. Adding another level of cache in the hierarchy between original cache and memory is straightforward. While the faster first-level cache can be small enough having clock cycle time nearly matched with that of the fast CPU, the second-level cache may be a comparatively slower one, but large enough to arrest most of the accesses that would otherwise go to memory, thereby reducing the effective miss penalty. Although, the use of second-level cache may be an useful approach, it may invite some complications while performance analysis is being carried out.
The access mechanism being followed by two-level cache is: each CPU request will be first intercepted by first-level cache (LI), and the cache will be searched. If it is a hit, the CPU request will be then promptly serviced, and the time taken to service may be called Hit time of LI (HTU). If it is a miss, it will be passed to the second-level cache for appropriate action. The miss penalty for LI is thus to be paid. The second-level cache will now be searched. If it is a hit, the CPU request will be then serviced, and the time taken to service it may be called Hit time ofL2 (HT^). If it is a miss, it will be then passed to the main memory for appropriate action. The miss penalty for L2 is then to be paid. It is to be noted that the second-level miss rate is measured on the leftover of first-level cache. Let the Miss rate and the Miss penalty for LI be denoted by MRU and MPL1 respectively. Similarly, they are MRL2 and MPL2 respectively for L2. MRL2 is sometime called local miss rate which is the number of misses in this cache divided by the total number of accesses to this cache. Another term global miss rate is defined as the number of misses in this cache divided by the total number of memory accesses generated by the CPU. Global miss rate of L2 is MRU x MRL2. It is expected that the local miss rate is always large because the first-level cache LI skims the cream of all the memory accesses, and hence, the global miss rate is becoming a more useful measure. It identifies the fraction of memory accesses that forces the CPU to visit all the way to memory. Let us now estimate an average memory access time while such a two-level cache LI and L2 referred as first-level and second-level cache respectively is used.
-  Use of an additional small, fully associative cache between a cache and its refillpath to negotiate the situation of primary cache miss. • Additional hardware facility (extra cache) to provide prefetching of instructionsand data. This means that while the caches continue to supply the instructionsand data from their storage, the prefetched instructions are being fetched in parallel and arrive at cache simultaneously. Such type of nimble cache is called nonblocking cache or lookup-free cache. • Use of optimizing compiler that enables to carry out smooth prefetching by way ofappropriate reordering the instruction codes during compilation without affecting