Distributed Shared Memory Multiprocessors (DSM): NUMA Model

The two popular approaches that eventually converted to numerous commercial products for providing a multiple-processor system to support applications are SMPs and Clusters (to be discussed in Section 10.9). Another architectural approach that has drawn considerable interest in this area is the non-uniform memory access (NUMA) architecture.

NUMA is a comparatively more attractive form of a shared-memory multiprocessor system where the shared memory is physically distributed (attached) directly as the local memory to all processors so that each processor can sustain a high computation rate due to faster access to its local memory. A memory unit local to a processor can be globally accessed by the other processors with an access time that varies with the location of memory word. In this way, the collection of all local memories attached to individual processors forms a global address space shared by all processors. NUMA machines are thus legitimately called distributed shared-memory (DSM) or scalable shared-memory architectures. A slightly different implementation of NUMA multiprocessor is even supplemented by a physical remote common memory, in addition to its existing usual distributed memory that is local to a processor but global to other processors. As a result, this scheme forms a memory hierarchy where each processor has the fastest access to its local memory. The next is its access to global memories that are individually local to other processors. The slowest is the access to common remote memory. Figure 10.17 exhibits a representative architectural scheme of such a NUMA system in which the large rectangle encloses a node of the system. A few of the many commercial products introduced on the basis of NUMA architecture are, namely, the BBN TC-2000, which uses 512 Motorola 88100 RISC processors and a butterfly network for interconnections. The Silicon Graphics Origin NUMA system is designed to support up to 1024 MIPS R10000 processors, and the Sequent NUMA-Q is built, with up to 252 Pentium II processors.

Similar to SMP, the architecture used in the NUMA model must ensure coherence between caches attached with CPUs of a node, as well as the coherence between existing non-local caches. Consequently, this requirement, as usual, consumes part of the bandwidth of interconnection networks that eventually may cause memory accesses to be slowed down (Stenstrom, P, et al.).

Usually, the nodes in the NUMA (DSM) architecture are typically high-performance SMPs. Each such SMP contains around 4 or 8 CPUs, and all SMPs are then connected by a high-speed global interconnection network. Due to the availability of this non-local communication network, the NUMA architecture becomes fairly scalable, and it is now possible to add more nodes, whenever needed, to obtain a more improved performance (Grindlay, R. et al.). The actual performance of a NUMA system, however, depends mostly on the non-local memory accesses made by the processes following the memory hierarchy during their execution. This issue, however, lies within the domain of OS that would be addressed in the next section.

FIGURE 10.17

A scheme of NUMA model (Distributed shared memory architecture).

Cache-Coherent NUMA: CC-NUMA Model

The wide acceptance of NUMA architecture-based machines in the commercial market encouraged designers to redefine and reconstruct the existing NUMA architecture that would be mostly free from its inherent drawbacks. One of the critical shortcomings in the traditional NUMA model is the cache coherence problem between caches attached with CPUs within a node, as well as inconsistency between existing non-local caches (caches attached with other nodes). As a result, the popular commercial products that were subsequently launched without having a cache-coherence problem are CC-NUMA systems (Stenstrom, P., et al.), in which cache consistency is rigorously maintained among the various caches attached with their respective processors (Scheurich, C, et al.). While CC-NUMA systems are quite distinct from both SMPs and Clusters (to be discussed in Section 10.9), they are still considered to be more or less equivalent to Clusters, although they are often referred to in the commercial literature as CC-NUMA systems.

As usual, multiple independent nodes, each of which is essentially an SMP organisation, is the basic building block of the overall CC-NUMA organisation. Figure 10.18 depicts such a representative CC-NUMA organisation, in which each node contains multiple processors, each of which with its own LI and L2 caches, plus main memory (Lee, R. L., et al.). While each node consisting of multiple processors is connected with a private local main memory of their own, from the point of view of all the processors present in the entire system, there is only a single addressable transparent system-wide memory in which each location has a unique system-wide address. In fact, when a processor initiates a memory access, if the requested memory word is not in the processor's cache (considering both LI and L2), the L2 cache initiates a fetch action across the local bus to get the desired

FIGURE 10.18

A scheme of a CC-NUMA organisation.

word from the local portion of the main memory. If the desired word is not found, then the global portion of the main memory is consulted. An automatic request is then issued across the interconnection network to fetch the word; it is then delivered to the local bus of the requesting processor, and then is finally delivered to the requesting cache on that bus. The whole proceedings are carried out most transparently in an automatic manner (Scheurich, C, et al.).

In order to implement such system-wide transparency in memory systems as well as to mitigate the critical cache coherence problem, each node, in general, must maintain some sort of directory (see cache coherence solution, Chapter 4), which essentially keeps track of the status of all those blocks of physical memory that are being shared, and also the cache status information (Lilja, D. J.). The implementation of this directory or some other suitable data structure, however, may differ in details which, to some extent, may be influenced by the types of interconnection network being employed. Apart from having all these hardware mechanisms to transparently implement single system-wide memory (although actually distributed over the nodes) system, some other additional mechanisms are still needed in some form of cache coherence protocol to negotiate the inherent cache inconsistency problem (Lee, R. L., et al.). Different systems, of course, handle this problem in various befitting ways (see Chapter 4; cache coherence protocol), and their implementations also differ. However, the CC-NUMA model exhibits some distinct advantages. A few notable ones are (Lovett, T., et al.):

  • • This system can normally provide effective performance, superior to an SMP and offering higher levels of parallelism without requiring major software upgrades as such.
  • • Presence of multiple NUMA nodes facilitates the bus traffic on individual nodes to remain within a limit and cater to a demand that the bus can sustain and handle.
  • • The design of LI and L2 caches is so made that it minimizes all types of memory accesses, thereby preventing the performance from degrading as such, even if many of the memory requests are targeted to remote nodes.
  • • High-quality performance of LI and L2 caches can be ensured by the use of appropriate software that has good spatial locality. Software having this quality almost makes it certain that for an application, a limited number of frequently used pages, which can be initially loaded into the memory local to the running application, can contain most of the information needed during run-time.
  • • By including a page migration mechanism in the operating system for handling virtual memory, it is possible to move a virtual page to a specific node that is frequently using it. Silicon Graphics designers have found significant success with this approach (Whitney 97).

Still, CC-NUMA is observed to have suffered from several drawbacks. Apart from the performance degradation that may happen due to frequent addressing of remote memory accesses, there are other disadvantages that have been also noted. First of all, a CC-NUMA does not look as transparent as an SMP. In fact, software changes will be required when an operating system as well as applications is moved from an SMP to a CC-NUMA. These include page allocation, process allocation, load balancing, and other similar aspects that are normally carried out by these operating systems. The second one is essentially that of its availability. Actually, this is rather a complex issue and depends mostly on the exact implementation of the CC-NUMA system within the periphery of present hardware improvement (Stenstrom, P, et. al).

No Remote Memory Access (NORMA)

Multiprocessors having a global memory system allow any processor to access any memory module without any hindrance created by another processor. A different way of organizing

FIGURE 10.19

A distributed memory system multiprocessor.

this system can also be possible, as shown in Figure 10.19. All memory modules here serve as private memories for the processors that are directly connected to them (in contrast to single system-wide memory forming with all the memories connected to other processors or distributed over the nodes). No processor can access a global memory without the cooperation of the concerned remote processor. This cooperation can normally take place in the form of messages exchanged by the processors. Such systems are often called NORMA or distributed memory systems with a message-passing protocol.

General-Purpose Multiprocessors

A generalized multiprocessor system can be envisaged by combining the salient features of the UMA, NUMA and CC-NUMA (also COMA) models, as already discussed, that mainly use processors and memory modules as major functional units. But, any multiprocessor must also provide extensive I/O capability to manage the growing user density in its working environment, and hence, it should not be kept set aside in the design concept of the generalized multiprocessor system organisation. Above all, the system must include efficient interconnection networks for fast communication among multiple processors and shared memory, I/O, and peripheral devices. A high-level view of such a possible multiprocessor organisation is depicted in Figure 10.20.

Each processor P, is provided with its own local memory and private cache. Multiple processors are connected to shared-memory modules through an interprocessor-memory interconnection network. Processors are themselves directly interconnected with one

FIGURE 10.20

Generalized multiprocessor system with interconnection structures using local memory, cache memory, shared memory and shared peripheral devices.

another through an interprocessor communication network, instead of through the shared memory. The processors gain the access of shared I/O and peripheral devices through a separate processor-I/O interconnection network. The performance and cost of these machines critically depend on the implementation details, especially, on the efficiency and capability of various types of networks that can be used for different purposes of communication, with numerous alternatives.

Implementation: A Mainframe SMP (IBM z990 Series)

An SMP realized with the use of a single shared bus is probably one of the most common arrangements for PCs and workstations. But with this arrangement, the single bus becomes a bottleneck while managing growing traffic load, and also critically affects the scalability of the design. An alternative approach has however been used in a recent implementation introduced by IBM in its z-series mainframe family called the z990 series (Siegel, 04, Mak, 04). This family of systems is absolutely scalable, spanning a range from a uniprocessor with one main memory card to a high-end system consisting of 48 processors and 8 memory cards. The system contains a particular type of unit, called book, which is a pluggable unit that can contain up to 12 processors with up to 64 GB of memory, I/O adapters, and a 32 MB L2 cache in a system control element (SCE) which connects other elements. The z990 system comprises of one to a maximum of four such books. Each book contains two self-controlled memory cards (for a total of eight such cards in four books across the maximum configuration), each card can handle its own memory accesses at relatively high speeds. Each processor chip is dual-core consisting of two identical CISC superscalar microprocessors used as a central processor (CP). Each processor is connected to a single memory card (actually to L2 cache) by a point-to-point links (switched interconnections) and each L2 cache, in turn, has a link to each of the two memory cards on the same book via the memory store control (MSC). Instead of using bus, point-to-point links (switched network) are also employed here to connect I/O channels. Besides, many other additional nice features, however, have been also included in the design that altogether contributes to yield significant performance improvement as well as an appreciable reduction in average bus traffic.

A brief detail of a representative scheme of the IBM z990 multiprocessor structure with the key components of the configuration is described with relevant figure onthe website: http://routledge.com/9780367255732.

Operating System Considerations

Due to rapid advancement in the VLSI technology, multiprocessor architectures by this time have constantly evolved with a diverse spectrum of dimensions, the dominant ones being the shared memory UMA model giving rise to Symmetric Multiprocessor system (SMP), and NUMA model along with its numerous variations in forms and patterns using distributed shared memory architecture. The types and forms of the operating systems (OS) thus needed to efficiently drive these various kinds of multiprocessors having dissimilar architectural patterns should also be different. In fact, it is the OS that essentially projects a view of the underlying hardware to the user, and can best extract the highest potential of the underlying hardware. If the OS can be matched perfectly with the underlying hardware architecture (made for each other), it can then stage a magical performance going even beyond the highest level of expectations, and be able to extract the best out of the hardware capabilities provided. However, the details of the various types of operating systems to be employed to drive these different kinds of multiprocessor models is outside the scope of this book at present; interested readers are hereby referred to consult any standard text book in this area, including the one by the same author, (Operating Systems; Chakraborty, P, Jaico Publications, 2011).

The need of an appropriate operating system to efficiently drive each of the various kinds of multiprocessors having dissimilar architectural forms is described on the website: http://routledge.com/9780367255732.

 
Source
< Prev   CONTENTS   Source   Next >