Crossbar Networks

Switched networks are versatile and provide dynamic interconnections between the sources and the destinations. The simplest circuit for connecting n CPUs with m memories is the crossbar switch that have been used for decades in computers, and also in telephone networks, to connect a group of incoming lines to a set of outgoing lines in an arbitrary way. In this network, any module X, can be dynamically connected to any other module Yk, by choosing an appropriate setting of switches on. In fact, there is a direct link between all pairs of nodes in this network, and hence, this type of full}/ connected network allows many transfers simultaneously. If n sources need to communicate with m distinct destinations, then all of these transfers are possible independently and asynchronously due to having all possible (all permutations) paths and it also permits all the transfers to take place concurrently. Since, any transfer can be made, and is not prevented at any point of time; the crossbar network is sometimes called a nonblocking network.

Figure 10.5 shows just a simple single switch that is placed at each intersection of a horizontal and vertical line called a crosspoint. This switch can be electrically opened or closed, depending on whether the horizontal and vertical lines are to be connected or not. In an actual multiprocessor implementation, the paths through the crossbar network are much wider, which implies that multiple switches are needed at each crosspoint. Moreover, while a total of n modules are interconnected in such a network, the number of crosspoints becomes и2 and this makes the total required number of switches very high as n gradually increases, resulting in a less cost-effective network and also requiring extensive hardware in implementation. This awful property thus causes to restrict the widespread use of crossbar networks in general, only being found profitable in use in mostly small or mediumsized systems. The advantage is its nonblocking nature that requires no advance planning. Moreover, processor interfaces and memory port logic becomes simpler and cheaper, and crosspoint switches being used are also equipped with arbitration logic as well as conflict resolution mechanism while many other logics ultimately bear all such responsibilities required for hassle-free communications, in this approach.


(a) An 8x8 Crossbar Switch, (b) An open crosspoint, and (c) A close crosspoint.

In an n x n crossbar networks, n processor can send at most n memory requests independently and asynchronously. Two situations may arise. If all the requests come from n processors to n distinct memory words, then those can be delivered simultaneously in each memory cycle. This is possible when the memory modules are made n-way interleaved to allow overlapped access. If all, or some of the requests are destined for the same memory module at the same time, then only one of the requests will be serviced at a time.

With the use of unary (can be set on or off) switches, crossbar networks are single-stage nonblocking. A single-stage crossbar network is not expandable once it has been built. One such implementation is found with larger crossbar switches in Sun's E10000 system in which 16 four-processor nodes are connected by a 16 x 16 crossbar switch. Crossbar switch mechanism has been implemented even on-chip in a recent release from Sun in its UltraSparc T2 multicore multithreaded high-end processor (see Chapter 8).

However, a multilevel crossbar switch can also be used for more modules (larger configuration) to connect where a crossbar switch at level 1 connects a crossbar switch at level 2 and so on. Such systems have been found in use in large computers, like Fujitsu VPP500 (a vector parallel processor), Hitachi SR 8000 and NEC SX-5 systems. In these systems, a different type of large crossbar (224 x 224) network is used for processor-processor (interprocessor) communication. Here, the nodes are the processors (or processing element PEs with attached memory) and the CPs (control processors) which are used to monitor the entire system operations as well as the crossbar network operation. In this network, only one crosspoint switch can be set as on in each row and in each column. In all Cray multiprocessors, crossbar networks are used between the processors and memory banks. The banks use 64-, 128-, or 256-way of interleaving (in Cray Y-MP) that can be accessed via four ports.

Multiport Memory

One of the major drawbacks of a crossbar networks is its high cost and cumbersome wiring, when it is built in a multiprocessor. Many mainframe multiprocessors thus use a multi- port memory organisation to reduce the load, and thereby to lower the cost and complexities of the crosspoint switch being used. This approach summarily relieves the crosspoint switches from its additional burden of handling all arbitrations and conflict resolutions, and entrusts it to the memory controller attached with the memory modules. The solution thus offered by way of using multiport memory organisation is found somewhat midway between a low-cost, low-performance bus-based system and a high-cost, high-bandwidth crossbar system.

A brief detail of multiport memory and its operations with a relevant figure is given on the website:

Multistage Networks

Both the bus-based and crossbar systems just explained are single-stage switching networks connecting a source and a destination. Interconnection networks can also be developed using multiple stages of switches to set up more paths between sources and destinations. Such networks are less costly, easy to implement, and at the same time provide a reasonably large number of additional paths between sources and destinations. Multistage interconnection networks (MIN) are thus employed to build both larger multiprocessor systems and SIMD computers. A generalized multistage network is illustrated in Figure 10.6. Various classes of MINs can, however, be constructed that differ in the switch modules being used, and in the manner of wiring the inter-stage connection (ISC) patterns being built up (Adams, G. B. III).

The simplest design is based on the humble 2x2 switching elements as shown in Figure 10.7a. The switches can be dynamically set to establish the desired connections


A generalized structure of a multistage interconnection network (MIN) with rxy switching elements and interstage connection patterns (ISC1, ISC 2,.... ISC n).


Switching element with different states and message format, (a) Control C, (b) Direct, (c), Crossover, (d) Upper

broadcast, (e) Lower broadcast, and (f) Message format.

between the inputs and outputs to construct dynamic interconnection networks. Each switch S has a pair of input data buses X! and X2, a pair of output data buses Z, and Z2/ and some control logic (not shown). All four buses are identical and can function as processor-processor or processor-memory links. Messages arriving on either input line can be switched to either output line. Various combinations of the sxvitch states implement different permutations, broadcast or other connections from the inputs to the outputs.

When Z] = X, (Z, is connected to X,) and Z2 = X2/ it is called a direct or through state T of S. This is shown in Figure 10.7b. When Z, = X2 and Z2 = X„ itis called a crossover state X of S as shown in Figure 10.7c. The broadcast state is shown in Figure 10.7(d) and (e). While switch states may be of various types, messages for our purposes will contain upto four parts as shown in Figure 10.7f. In message format as shown, the Module field tells which memory to use. The Address specifies an address within a module. The Opcodes gives the operations to be performed, such as, READ or WRITE. The optional Value field contains an operand, such as a 32-bit / 64-bit word to be written. The switching element examines the Module field and uses it to determine the direction of transfer, either to Z, or to Z2.

By using this 2x2 switch as a building block, it can be arranged in many different ways to build larger multistage networks (Wu, C. L., et al.) that can be used in massively parallel computing systems. The ISC (inter-stage connection) patterns also are of various types that are often used to include Multistage Crossbar, Perfect shuffle, Multiway shuffle, Butterfly, Cube connection, etc. Some of these ISC patterns are illustrated here in brief.

Omega Network (Perfect Shuffle)

The economy class Omega network can be constructed by any of four possible connections of 2 x 2 switches as shown in Figure 10.7 by setting the control signals of the switching elements in various ways. The processor-processor or processor-memory connections can be established with this network, which depends on the number of stages, the fixed connections linking the stages, and the setting of the switching elements. Here, we have connected eight CPUs to eight memories using 12 switching elements that are arranged into three stages and intended to provide dynamic connections among eight processors and eight memories as shown in Figure 10.8. More generally, to connect n CPUs with n


An 8x8 Omega network built in three-stage with 2x2 switches.

memories, the number of stages needed is log2n with n/2 switching elements per stage; each element in each stage is individually controlled, and a total of [(n/2) x log2u] simpler 2x2 switching elements are thus here required. This is, however, far better than n2 crosspoints (crossbar network), especially for high values of n.

The wiring pattern of omega network is often called perfect shuffle, since the mixing of signals at each stage resembles a deck of cards that is being divided in half, and then mixed card-for-card. Control logic associated with the MIN sets the switch states dynamically to service the interconnection requests issued from processors / memories. This can be viewed as self-routing. A particular MIN state after setting is retained for a long duration to allow at least one package to be transferred through the network. The state then changes to match the source-destination requirements of the next set of packages, and so on. It can be assumed that the processor can use a buffer to temporarily store or queue up its outgoing packages until the MIN is ready to transfer them. The processor, on the other hand, accepts incoming packages as soon as they arrive.

To see how the omega network works, suppose that CPU010 wants to read a word from memory module 110. The CPU containing 110 in the Module field sends a READ message to switch 1C. Data routing is controlled by inspecting the destination code (here 110) in binary. When the ith high-order bit of the destination code is a 0, a 2 x 2 switch at stage i connects the input to the upper output. Otherwise, the input is directed to the lower output. The CPU (010) connected to switch 1C takes the first bit (i.e. left most) of 110 and uses it for routing. The message is here routed via the lower output to 2B. The route of this communication is indicated by an arrow in Figure 10.8.

All the switches in second-stage including 2B, use the second left most bit of the destination for routing. This second leftmost bit here is also a 1 (in 110); the message is now forwarded via the lower output of 2B to 3D. Now, the third bit is tested, and is detected to be a 0. Consequently, the message goes out on the upper output of 3D and reaches the targeted memory 110.

If two requests from two CPU simultaneously want to access the same memory module, conflicts will occur, and one of them would have to wait till the other finishes, causing communication delays, and consequently, lowers the effective bandwidth. Unlike the crossbar switch, the omega network thus is a blocking network. Conflicts can arise due to a number of reasons; over the use of a wire, or over the use of a switch, or may even be between requests to memory and replies from memory.

Interleaving of memories that facilitate distribution of the memory references uniformly among the memory modules may reduce the occurrences of such conflict but it is not a full-proof one. Consider the problem of barrier synchronization that occurs in many parallel applications with multiple processes, where each one is trying to access a certain memory module at the same time. Mutual exclusion of these simultaneous accesses can be carried out, for example, using a counting semaphore variable, but all the processes at their own end sit in a tight loop, each one continuously checking the counter, and cannot proceed until the counter becomes zero. This memory module then becomes a hot spot (bottleneck), which summarily degrades the network performance significantly.

The Omega network can also be used to broadcast data from one source to many destinations as illustrated in Figure 10.9 using the upper broadcast or lower broadcast switch settings. The figure shows that the message at input 010 (processor 010) is being broadcast to all eight outputs (memories) through a binary tree connection (as shown with arrows in the figure).

The number of switches being used in Omega networks can be reduced using a trick that is employed in some multiprocessors. Instead of having log2n stages, there is only


Broadcast capability of Omega network (8 x 8) built in three-stages with 2x2 switches.

one stage in the network. The output thus obtained from this stage is once again fed back into the input. Thus, each message traverses log,» passes (not stages) over the network. Such a design is called a recirculating Omega netivork. The major disadvantage is that pipelining cannot be implemented (single stage). Hence, in a multistage Omega network, while upto (и/2) log,я messages can be simultaneously switched, in a recirculating network, it is certainly less, and the maximum is only n/2.

The Omega network thus has two major drawbacks-, namely, the blocking problem, and the hot spot issue that need to be addressed. The blocking problem (already discussed) in Omega networks has been resolved by Benes Networks. The solution of hot-spot problem has been, however, dealt with the Fetch-And-Add Approach.

Benes Network

The blocking problem in Omega networks can be solved by Benes network, adding more stages using here a five-stage network (in place of three stages) with additional hardware that makes it nonblocking. In an Omega network, there is exactly one path from any CPU to any memory, whereas in a Benes network, there is more than one path and many other alternatives.

A brief detail of Benes network and its operations with relevant figure is given on the website:

The Hot-Spot Problem Fetch-And-Add Approach

If multiple requests are issued simultaneously from many processors to access a certain memory module, the module may appear as a hotspot. Use of a lock or semaphore can make these simultaneous accesses mutually exclusive, but at the cost of wastage in busy waiting of the processors, due to continuous checking of the semaphore to gain access to the module. Moreover, this semaphore, as a result, may itself become a hot spot, since it is shared by many processors. As an alternative, a combining mechanism thus has been proposed that can be added to the Omega network to overcome this problem. Multiple requests issued simultaneously from many processors heading for the same destination are combined here at the switch points where conflicts are happening, and appropriate actions are then taken to resolve this problem.

A brief detail of the Fetch-And-Add algorithm and its operation is given on the website:

Butterfly Network

Another useful class of MINs is based on the butterfly connection of crossbar switches as depicted in Figure 10.10. Here, an 8 x 8, three-stage butterfly network is shown using 2x2 switching elements; note that the butterfly connection is placed after, rather than before, the n/2 switching elements.

Butterfly networks of different sizes can be realized with different crossbar switching elements like 4 x 4, 8 x 8, etc. If an 8 x 8 crossbar switch is used, then to connect 64-input, a network consisting of only two stages (2 = log8 64) is required. This is illustrated in Figure 10.11. For more inputs to connect, either more number of stages or more-way switching elements are needed to build the network.

Drawbacks of Multistage Networks: The two major drawbacks of multistage networks are: the cost of the switching elements being used (either 2 x 2 or 4 x 4 or 8 x 8) associated with the additional expenditure due to increased wiring, and the penalties being paid to implement the required atomic operation while providing concurrent accesses attempted by different sources to the same destination. These two drawbacks can, however, be overcome only with the use of cheaper and faster switching technology, if available, and in that situation, this network with all its merits can be employed to build up large-scale UMA (Uniform memory access) multiprocessor systems since the delay for paths connecting any two modules is always same. Interest in these networks gradually increased, and attained a peak in the 1980s, and diminished significantly in the past few years. Other schemes which we will be discussing next, gradually came in favour for having many distinct advantages, and hence, have become more attractive.

FIGURE 10.10

An 8x8 Butterfly switch network with 2x2 Crossbar switches.

FIGURE 10.11

A 64 x 64 Butterfly switch network with 8x8 Crossbar switches using two-stage and eight-way shuffle interstage connections.

< Prev   CONTENTS   Source   Next >