Atomic Memory Operations

The Epiphany-III ISA does not have support for atomic instructions, but the TESTSET instruction used for remote locks may be used to define other atomic operations in software. With the current code design, it is trivial to extend to other atomic operations with a single line of code if additional atomic operations are defined by the OpenSHMEM specification in the future. At a core level, memory access for both fetch and set operations completes in a single clock cycle and is therefore implicitly atomic. The fetch operation still must traverse the network to the remote core and return the result. Each data type specialization uses a different lock on the remote core as per the specification. The performance results for the 32-bit integer atomic routines appear in Fig. 5.

Collective Routines

Multi-core barriers are critical to performance for many parallel applications. The Epiphany-III includes hardware support for a fully collective barrier with the WAND instruction and corresponding ISR. This hardware support is included as an experimental feature within the OpenSHMEM library and must be enabled

Performance of OpenSHMEM atomic operations for 32-bit integers and a variable number of processing elements. Atomic operations are performed in a tight loop on the next neighboring processing element

Fig. 5. Performance of OpenSHMEM atomic operations for 32-bit integers and a variable number of processing elements. Atomic operations are performed in a tight loop on the next neighboring processing element.

Performance of shmem.barrier for variable number of processing elements (left) and the performance of shmem_broadcast64 for variable message sizes (right)

Fig. 6. Performance of shmem.barrier for variable number of processing elements (left) and the performance of shmem_broadcast64 for variable message sizes (right)

by specifying SHMEM_USE_WAND.BARRIER at compile time. After several implementations of barrier algorithms, it was determined that a dissemination barrier was the highest-performing software barrier method. It is not clear if this algorithm will continue to achieve the highest performance on chip designs with a larger number of cores; alternative tree algorithms may be needed. The eLib interface in the eSDK uses a counter-based collective barrier and requires a linearly increasing amount of memory with the number of cores. The dissemination barrier requires 8 • log2(N) bytes of memory, where N is the number of processing elements within the barrier. The use of this synchronization array mitigates the need for signaling by locks at each stage of the barrier. The collective eLib barrier completes in 2.0 ps while the WAND barrier completes in 0.1 ps. The performance for group barriers for a subset of the total processing elements is shown in Fig. 6. The latency of the dissemination barrier increases logarithmically with the number of cores so that more than eight cores take approximately 0.23 ps.

Broadcasts are important in the context of the Epiphany application development in order to limit the replication of off-chip memory accesses to common memory. It is faster to retrieve off-chip data once and disseminate it to other processing elements in an algorithmic manner than for each processing element to fetch the same off-chip data. The data are distributed with a logical network tree, moving the data the farthest distance first in order to prevent subsequent stages increasing on-chip network congestion. The broadcast routines use the same high-performance memory copying subroutine as the contiguous data transfers. Effective core bandwidth approaches the theoretical peak performance for this algorithm and is approximately 2.4/log2(N) GB/s. Figure6 shows collective broadcast performance for variable message sizes.

The shmem_collect and shmem_fcollect routines use ring and recursive doubling algorithms for concatenating blocks of data from multiple processing elements. Each uses the optimized contiguous memory copying routine. There is likely room for improvement with these routines; the measured performance appears in Fig. 7.

Performance of linear scaling shmem_collect64 and recursive doubling shmem_fcollect64 for variable message sizes on 16 processing elements

Fig. 7. Performance of linear scaling shmem_collect64 and recursive doubling shmem_fcollect64 for variable message sizes on 16 processing elements

The shmem_rYPF_OP_to_all reduction routines are important for many multicore applications. The routines use different algorithms depending on the number of processing elements. A ring algorithm is used for processing elements that number in non-powers of two and a dissemination algorithm for powers of two. The symmetric work array is used for temporary storage and the symmetric synchronization array is used for multi-core locks and signaling. The performance of shmem_int_sum_to_all appears in Fig. 8. Other routines vary marginally in performance due to data types and the arithmetic operation used. Reductions that fit within the symmetric work array have improved latency as seen in the figure.

Reduction performance for shmem_int_sum_to_all for all 16 processing elements

Fig. 8. Reduction performance for shmem_int_sum_to_all for all 16 processing elements. The latency and the number of collective reductions per second are shown. The effect of the minimum symmetric work array size for reductions, defined as SHMEM_REDUCE_MIN_WRKDATA_SIZE per the OpenSHMEM specification, is apparent for small reductions

Performance of the new (to version 1.3) contiguous all-to-all data exchange operation, shmem_alltoall, for 16 processing elements

Fig. 9. Performance of the new (to version 1.3) contiguous all-to-all data exchange operation, shmem_alltoall, for 16 processing elements

The performance of the contiguous all-to-all data exchange, shmem_alltoall, appears in Fig. 9. This routine has a relatively high overhead latency compared to other collectives.

Distributed Locking Routines

The distributed locking routines, shmem_set_lock and shmem_test_lock, are

easily supported by the atomic TESTSET instruction. The actual lock address is defined in the implementation to be on the first processing element. These locking mechanisms are also the basis for the atomic operations detailed in Sect. 3.5 but for multiple processing elements. The shmem_clear_lock routine is a simple remote write to free the lock. Although this scheme works well for the 16 processing elements on the Epiphany-III, the performance bottleneck will likely be a problem scaling to much larger core counts. Application developers should avoid using these global locks.

 
Source
< Prev   CONTENTS   Source   Next >