Caches in Multiprocessor

The emerging dominance of the microprocessors blessed by the radical developments in VLSI technology in the beginning of the 1980s motivated many designers to create small- scale multiprocessors where several processors were on the motherboard, shared a single physical memory built around a single shared bus. In the mid-1990s, with large caches, the bus, and the single memory can satisfy the memory demands of a multiprocessor built with a small number of processors. In this architecture, each processor as shown in Figure 4.33 has been given uniform access to this single shared-memory, and hence, these machines are sometimes called to have uniform memory access (UMA) architecture. This type of LIMA architecture consisting of a centralized shared memory connected with a small number of processors (CPU and IOP) and supported by large caches which significantly reduce the bus bandwidth requirements, is currently by far the most popular organisation, and is extremely cost-effective. Another approach consists of multiple processors with physically distributed memory. To support larger processor counts, memory must be distributed among the processors, with each processor having a part of memory of its own to support the bandwidth demands. An illustration of such an architecture is shown in Figure 4.34.

Whatever is the memory-module design, typically, all processing elements (CPU and I/O processors), main-memory modules (shared or distributed), and I/O devices are attached directly with the single shared bus that eventually results in potential communication bottleneck, leading to contention and delay whenever two or more units try to access the main memory simultaneously. This severely limits the number of processors

FIGURE 4.33

Basic structure of a centralized shared-memory multiprocessor.

FIGURE 4.34

Basic architecture of a distributed memory machine containing individual nodes (= processor + some memory + typically some I/O and interface to interconnection network).

that can be included in such a system without an unacceptable degradation in performance. To improve the performance considerably, each CPU is, however, provided with a cache or a local memory connected by a dedicated local bus which can even include local IO devices [Lee, R. L., et al.]. Local memory supports caching of both private and shared data. All these local buses are, in turn, connected with the main shared system bus.

When a private item is cached, its location is simply migrated to the cache, thereby reducing the average access time and much of the routine memory traffic being taken away from the main shared bus, that summarily then releases the shared bus to perform its other important functions. But, when shared data are cached, the shared value can be replicated in multiple caches attached with multiple processors. This replication often facilitates to achieve a reduction in contention for the shared system bus that may be used for shared data items. But, these shared data might be independently operated by multiple processors individually at their own end, likely generating different values (inconsistent values) of the same item by different processors almost at the same time. Thus, caching of shared data, on the other hand, causes to introduce a new critically typical problem, historically called cache coherence problem or cache inconsistency.

Cache Coherence

In shared-bus multiprocessor system using shared memory, the introduction of caches with/without local memories to each processor is a compulsion to reduce the average access time in each processor, and to reduce the contention for the shared system bus. Thus, each processor has its own private cache which allows both data and instructions to be accessed by the respective processor without using the shared system bus. Each processor then views the memory through its own individual cache. If a multiprocessor has an independent cache in each processor, there is a possibility that two or more caches containing different values of the same location (variable) at the same time. This is generally referred to as the cache coherence or multi-cache consistency problem. Figure 4.35 illustrates the problem, and shows how two different processors can have two different values for the same location.

In Figure 4.35, a single memory location (A) is read and written by two processors X and Y. Initially, we assume that neither cache contains the variable A and that A has the value 1 in main memory. We also assume a write-through cache, however, a write-back cache has similar complications, and sometimes, even something more. After the value of A has been written by X, both X's cache and memory location A has the value 1, but Y's cache does

FIGURE 4.35

Cache coherence problem with two processors.

not. After time interval 3, a different version of the same location (A) is found in different caches. The change made by X is not known to Y, and Y had the old stale data violating the semantics (ethics) of shared memory (when one process changes a word, subsequent reads by other processes must return the new value). This inconsistency in the value of the same location in different caches is called the cache coherence problem. Without an appropriate solution, caching, hence, cannot be used, and bus-oriented multiprocessors would then be limited to only two or three CPUs. A mechanism thus needed that allows each cache to be informed about changes to shared information stored in any other caches to ensure cache consistency.

Reasons of Coherence Problem

The entire set of information used in any program can be classified as shared versus not- shared, and also read-only versus writable data that are not shared, i.e. data that are private to a process, cannot cause cache inconsistencies, because they will never appear in other caches simultaneously. Similarly, data that are read-only (like instructions) cannot cause any problems because they are never changed, no questions at all arise about different versions. Then, the category that causes all the trouble is shared and writable data. Whatever write policy be adopted, the problem remains same. A write-through maintains consistency between memory and the originating cache, but the other caches in the system remain inconsistent, since they still hold the old values. In a write-back policy, main memory is not updated at the time of any change in a particular cache. The copies in main memory, and the other caches are then found to be inconsistent. Memory, however, will ultimately be updated when the modified data in the cache are copied back to memory.

Another concern to this inconsistency problem arises in systems having direct memory access (DMA) activity in conjunction with an IOP (I/O channel) connected to the system bus. The DMA may modify (input) locations in main memory that may also reside in cache without updating the cache. The DMA may also read (output) a memory location that is even present in cache and not currently updated, but will be updated later only at the time of write back. In both the situations, inconsistent data will be passed. This indicates that not the caches alone, but IOP should also be considered as a party while addressing cache coherence problem [Lilja, D. J.].

Cache Coherence Problem: Solution Methodologies

In a coherent multiprocessor, the caches provide both migration and replication of shared, writable data. Coherent caches provide migration, since a data item can be moved (migrated) to a local cache, and is used there in a transparent fashion, this obviously reduces the latency to access a shared data item which is remotely allocated. Coherent caches also support replication of shared data that is being simultaneously read, since the caches make a copy of the data item in the local cache, thereby reduces both latency of access and contention for a read shared data item. Supporting this migration and replication is critical to performance in accessing shared data. Various schemes thus have been proposed to solve this cache coherence problem in shared/distributed memory multiprocessor systems.

The problem can be solved by hardware and/or software means. Hardware-only solutions, by virtue, have the advantage of higher speed and program transparency, but are quite expensive. Software-based solutions to ensure cache consistency require the ability to tag information (data) at the very beginning of program compilation as either cache- able (whether can be kept in a cache) or non-cacheable. We briefly discuss some of these schemes.

No Private Cache

Since the presence of private cache to each processor with shared writable data is the root of this problem, a simple scheme is not to permit any private cache for processor in place to provide a shared cache memory associated with main memory. Every data accessed by any processor is then to be done through the shared cache. Hence, there will be no inconsistency in data. But, this method will neither reduce the latency to access a shared data item nor minimize the bus contention. It is at the same time disobeying the principle of closeness of CPU to cache. Thus, rather than trying to solve the problem, it can at best be an attempt called a scheme of avoidance.

Software Solution

Private cache to each processor in a multiprocessor system is a must, and there is as such no alternative from a performance point of view. Then, the problem indicates the category that causes all the trouble is only shared, writable data. One approach may then be to allow only поп-shared and read-only data to be stored in caches. Such items are called cacheable, and the remaining shared writable data are to be operated only in main memory, and no cache can migrate or replicate these items in their private region. These items are called non-cacheable. When a program is compiled, the compiler must tag data either cacheable or non-cacheable, and the system hardware must ensure that only cacheable data are to be stored in caches. Cache coherence can then be implemented by a write-through policy that requires a processor to mark a shared cache item (non-cacheable) К as invalid, or to be deallocated, whenever the processor writes into K. When the processor references К again, it insists to access main memory, thereby always getting the most recent version of K. Invalidation, however, is not a good approach, because it may sometimes enforce the removal of needed data from the cache, thus reducing its hit ratio, and thereby increasing the main memory traffic. Moreover, this method, however, eventually imposes some restriction on the type of data to be stored in caches, and invites an extra software overhead leading to a situation of some degree in degradation in system performance.

A slight variation in this approach is a scheme that permits writable data to exist in at least one cache. This method generates a centralized global table at the time of compilation. The status of all memory blocks is stored in the central global table. Each block is identified either Read-Only (RO) or Read-Write (RW). All caches can have copies of blocks identified as RO. Only one cache can have a copy of an RW block. Thus, if the data are updated in the cache with an RW block, the other caches will not be at all affected, since they do not have a copy of this block. Inconsistency in data in different caches will then no longer exist.

Another scheme that provides a solution is based on collecting all the shared, writable data together and putting them in a separate segment, page, or part of the address space. The compiler will take this responsibility, and will put all shared writable data in a predefined area in virtual memory. During run time, caching would then be turned off for addresses lying in this pre-defined area, and all references would be negotiated directly from memory and be written back to memory. It would never be cached, no matter how often they are used. Addresses being referred beyond this jurisdiction would be cached as usual. Implementation of this software solution demands that the cache hardware should be equipped with the ability to disable caching for a range of addresses, and hence, requires some additional hardware assistance.

Hardware-Only Solution

While the software approach is offering a formidable solution to eliminate the cache coherence problem, but it is so achieved at the cost of notable degradation in performance, mainly due to:

  • • disabling caching for shared, writable data,
  • • increasing in main-memory traffic,
  • • reduction in hit ratio, and for other similar reasons.

For applications, where heavy use of shared writable data is inevitable as in the case of some database systems, this solution is found to be not only inadequate, but fails to provide at least a reasonable return. We, hence, need to look at a comparatively speedy hardware solution with program transparency by introducing a protocol to maintain coherent caches (to remove inconsistency). The set of rules implemented by the CPUs, caches, and main memory for preventing different versions of the same memory block contents from appearing in multiple caches is called the cache consistency protocols. These protocols also maintain coherence for multiple processors [Archibald, J., et al.]. A key to implementing this protocol is to track the state of any sharing of a physical memory data block. There are two classes of protocols in use, which exploit different techniques to track the sharing status:

  • • Directory based: The sharing status of a block in physical memory is kept just in one location, called the directory. Distributed shared-memory (DSM) architecture (NUMA) uses this technique [Agarwal, A., et al.].
  • • Snooping: Every cache that has a copy of the data item from a block of physical memory also has a copy of sharing status of the block, and no centralized state like directory is kept. The cache controllers are specially designed to monitor or snoop on the bus to determine whether or not they have a copy of a block that is requested on the bus from CPUs and IOPs. These devices are called snooping caches or sometimes snoopy caches.

A simple method based on snoopy cache protocol is to adopt a write-through policy using the procedure as explained. All the snoopy controllers watch the bus for memory write operations. When a word in any cache is modified by writing into it, the corresponding memory location is accordingly updated by virtue of write-through. All the local snoopy controllers in other caches check their entry individually to determine if they have a copy of the item that has been just modified. If a copy exists in any remote cache, the location is marked invalid. As all caches snoop on every bus writes, whenever an item is modified in any cache, the main memory is accordingly updated, and it is removed (invalidated) from all other caches. At a later time, when a processor accesses the invalid item from its own cache, the response is equivalent to a cache miss, the corresponding main memory location is consulted, the updated item from main memory is transferred to the cache in question. No inconsistent versions, hence, are present in any cache, and the coherent problem is thereby arrested [Vernon, M. K., et al.].

Snooping protocols generally are used in systems with caches attached to a single shared memory. These protocols can use a pre-existing physical connection: the bus to memory, to interrogate the status of the caches. Particularly, small-scale multiprocessors built with microprocessors having single-shared memory makes this protocol most popular.

Many variations of these different protocols have been proposed, explored, implemented, and analysed. We will briefly discuss here two basic approaches to the snoopy protocol. For the sake of simplicity in delivery, we are intentionally avoiding here to include the mechanisms involving multiple-level (LI and L2) caches as well as not considering systems built on distributed multiprocessors. In any case, this would not, however, add any such new principles.

4.7.15.3.1 Write-Invalidate Protocol: MESI Protocol

The write-invalidate approach is probably the most widely used one in commercial multiprocessor system built with microprocessors, such as Pentium 4 and PowerPC. With this protocol, there can be multiple readers, but only one writer at any point of time. As usual, a line may be shared among several caches for reading purposes. But, when one of the caches wants to perform a write to the line, it first issues a notice that invalidates this specific line in the other caches making the line exclusive to the writing cache. Once the line becomes exclusive, the owning processor can then easily execute the local writes in no time until some other processor requires the same line.

When this protocol is used, it marks the state of every cache line (using two extra bits in the cache tag) as Modified, Exclusive, Shared, and Invalid. That is why, this protocol is sometimes called MESI protocol. This method ensures that a processor has exclusive access to a data item before it writes that item. This style of protocol is called a write invalidate protocol because it invalidates other copies on a write. It is by far the most common protocol, both for snooping and for directory schemes. Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs; all other cached copies of the item are invalidated. Figure 4.36 shows an example that ensures coherence by way of adopting an invalidation protocol.

On a snooping bus for a single cache block A with write-back caches are attached two processors X and Y. Since the write requires exclusive access (here processor X), any copy

FIGURE 4.36

Invalidation protocol using snooping bus with write-back caches.

held by the reading processor (here, processor Y) must be invalidated. Thus, when the read by the processor Y occurs, it misses in the cache and is forced to fetch a new copy of data from main memory. At this moment, CPU X responds with the value from its cache cancelling the response from memory as requested by CPU Y. Both the contents of Y's cache and memory contents of A (write-back) are updated. This is illustrated in line 4 of Figure 4.36. Writing processor thus has exclusive access preventing any other processor from being able to write simultaneously. If two processors do compete to write the same data simultaneously, one of them wins the race, causing the other processor's copy in cache to be invalidated. Meanwhile, if this other processor wants to complete its write, it must obtain a new copy of the data with the updated value. In other words, this protocol enforces write serialization. This scheme has been used in several IBM 360/370 series of computers, and also especially in the larger model IBM 3033.

4.7.15.3.2 Write-Broadcast Protocol: Write-Update

The alternative to a write-invalidate protocol is a write-update protocol. With this protocol, there can be multiple readers as well as multiple writers also. When a processor wishes to update a shared line in its local cache, the word to be updated is informed and distributed to all others, and caches attached with other processors containing that line would then update it.

With this protocol, all the cache copies of a data item are to be updated as soon as the item is written (modified). The processor executing the write operation in its own cache will broadcast to all caches in the system (and to main memory) via the shared bus. Every cache then examines its assigned addresses to see, if the broadcast item is present in its own. If it is, the item in question will then be updated in cache, otherwise the broadcast item will be simply ignored. In this way, the consistency of shared data in all caches is maintained. To keep the bandwidth requirements of this protocol under control, it is useful to track whether or not the item in cache to be broadcast is shared; that is, whether contained in other caches. Another disadvantage of this technique is that every cache write operation, which is very frequent, forces all other caches to check the broadcast data, making the caches then unavailable for their own normal processing. In the decade, since these protocols were developed, write invalidate protocol, however, has ultimately emerged as the winner for the vast majority of designs. Figure 4.37 shows an example of a Write Broadcast protocol in operation.

Neither of these two approaches, however, seems to be superior to the other under all circumstances. Performance greatly depends on the number of local caches, and the pattern of memory reads and writes. In a bus-based multiprocessor, since the bus and memory bandwidth is usually the most demanded commodity, invalidation has become the

FIGURE 4.37

Broadcast protocol using snooping bus with write-back caches.

protocol of choice for most implementation. Since, the current trends are towards increasing processor performance and related increase in bandwidth, update schemes (Broadcast protocol) are expected to be used very infrequently, and at this juncture, the invalidate scheme naturally gets the edge over broadcast protocol. Nevertheless, some systems implement adaptive protocols that employ both write-invalidate and write-update mechanisms.

Two-Level Memory Performance: Cost Consideration

Let us now look at some of the parameters which are relevant for an assessment of cost to be incurred in a two-level memory organisation. The two-level memory module may, however, be: (i) Cache memory-Main memory or (ii) Main memory-Virtual memory. Whatever be the memory module, to find an estimate to express the average cost of this memory module while accessing an item, we must consider not only the size of these two levels of memory M, and M2, but also the cost per bit of these two individual memory systems to find an average cost of these two-level memory modules [Przybylski, S. A.]. We assume,

Ts = Average access time (system)

T, = Access time of M, (e.g., cache, disk cache etc.)

T2 = Access time of M2 (e.g., main memory, virtual memory (disk), etc.)

H = Hit ratio (fraction of time, reference is found in M, or the probability of finding a given reference in M})

Cs = Average cost per bit for the combined two-level memory Cj = Average cost per bit of upper-level memory M,

C2 = Average cost per bit of lower-level memory M2 Sj = size of memory Mi S2 = size of memory M2

Then

Our target is to attain CS~C2. Given that Ci»C2, this requires Si«:S2.

From the perspective of access time, the use of a two-level memory is to obtain a significant performance improvement, we thus need to have Ts approximately equal to T, (Ts~Tf. Given that T{ is much less than T2 (Tj«:T2), and hence, a hit ratio of close to 1 is demanded.

It is natural that one would like M, to be as small to hold down cost, and at the same time would like it to be large enough to improve the hit ratio, and thereby to enhance the performance. The objective would be then to find out a suitable size of M, that could satisfy both the requirements to a reasonable extent. This, in turn, demand the answers of a number of questions. A few of those are:

i. What value of hit ratio is expected to satisfy a desired level of performance?

ii. What size of Mi will ensure the required hit ratio as thought of?

iii. Whether this size of Mx as perceived will match the related cost to be incurred?

To get this, let us define a quantity Tx/Ts, which is referred to as the access efficiency. It is a measure of how close average access time (Ts) is to M, access time (Г,). From Section 4.7.4, we have

and hence,

This shows that access efficiency is a function of hit ratio H with a quantity T2/Tx as parameter. For Cache memory-Main memory, cache access time is about five to ten times faster than main memory access time, (i.e. T2/Tx is 5-10) and in case of Main memory-Virtual memory, main memory access time is about 1,000 times faster than disk access time (i.e. T2/Tx is about 1,000). Thus, a hit ratio in the range of 0.8-0.9 is expected to be required to attain the desired level of performance.

The ultimate aim is thus to attain the desired hit ratio so that the performance goal can be nearly met. This hit ratio again depends on a number of parameters including the design of two-level memory (cache design also), the replacement algorithm, the nature of the software being executed, and many more. All these parameters are so selected so that a greater degree of locality can be realized. The degree of locality to a large extent is determined by the relative memory size (Sx/S2) which is indicated in Figure 4.38. Now, if Si/S2 = 1, i.e. if M, is same as M2 in size, then the hit ratio will be 1.0; all of the items in M2 are always stored also in Mx. Now suppose that there is no locality, i.e. all the references are completely at random, then the hit ratio should be a linear function of relative memory size Si/S2 as shown in Figure 4.38. If the locality is appreciable, a high value of hit ratio can be achieved even with small values of relative memory size Sx/S2, i.e. with the small size of upper-level memory S,. This is indicated by the lines "moderate locality" and the "strong

FIGURE 4.38

Hit ratio as a function of relative memory size.

locality" in Figure 4.38, where a comparatively lower value of relative memory size (Si/S2) can yield a higher hit ratio.

Numerous studies have revealed the fact that rather small cache sizes can offer a higher hit ratio above 0.8, regardless of the size of the main memory. Usually, a cache size of 128-256K or even 512 words (on-chip), and off-chip of a multiple megabyte is generally found adequate, whereas the main memory is now typically in the multiple gigabytes. In the cases of virtual memory and disk cache, this observation still remains the same. This relative size of the two memories not only satisfies the users' requirements from the performance point of view, but also satisfies designers' targets to cost implications. If a relatively small size of upper-level memory can achieve a good performance, then it is natural that the cost per bit of the two levels of memory will tend to approach that of the cheaper lower-level memory.

Memory Hierarchy Design: Size and Cost Consideration

Let us assume the design of a three-level memory hierarchy with the current standard specifications for memory characteristics. The target is to attain a roughly effective memory-access time t = 10.00 gs as the design goal with a cache hit ratio of about h{ = 0.98, and a hit ratio in the main memory h2 = 0.9. Also, the total cost of the memory hierarchy is budgeted with some upper-bound. It can be shown with suitable computations that increasing of main memory size and a corresponding decrease in disk capacity for the sake of staying within the budgeted cost will not at all affect the cache hit ratio. Moreover, the effective memory-access time can once again be reduced without disturbing the budget, if the main memory and virtual memory management issues (like page replacement, page size, etc.) are properly addressed and resolved with care.

A brief detail of this topic supported by a table with a rough estimation is given in the web site: http://routledge.com/9780367255732.

 
Source
< Prev   CONTENTS   Source   Next >