OpenSHMEM as an Alternative to MPI
OpenSHMEM presents an alternative to message passing for inter-device communication in accelerated systems. Programs can combine OpenSHMEM with
CUDA or OpenCL in a hybrid model that uses one-sided memory access to communicate between devices. Future systems could add symmetric memory regions and one-sided access into accelerator kernels.
In this work, we pose a question: If a programmer uses SHMEM instead of MPI to write a hybrid program, what features do they require and how does the program perform? To answer this question, we ported the Scalable HeterOgeneous Computing (SHOC) benchmark suite  from MPI + CUDA to OpenSH- MEM + CUDA.
The SHOC benchmark suite represents a range of applications that have been shown to benefit from hardware acceleration. Porting these benchmarks shows how SHMEM can be used for periodic synchronization and communication between accelerator kernels. SHOC requires an atypical use of SHMEM, where we do not attempt to engineer communication/computation overlap with onesided accesses inside a computational core. So, we provide implementations of several MPI collectives using only SHMEM. We also find that these codes require the use of MPI groups, and so we provide an implementation of SHMEM teams to provide the same functionality.
The results of porting show that SHMEM is sufficient to replace MPI communication for these hybrid codes. Using SHMEM required implementing replacements for group based MPI collectives. We tested the implementation on the Cray XK7 system Titan, to demonstrate that the implementation does not show any significant performance reduction when using SHMEM instead of MPI.
This paper is organized as follows. In Sect. 2 we describe the contents of the SHOC benchmark suite and the ways in which MPI and CUDA are mixed in that code. We then briefly describe other work using OpenSHMEM with accelerated parallel systems. Section 3 describes the implementation specifics of structures ported from MPI to OpenSHMEM in SHOC. Finally, Sect. 4 presents the performance results of the ported code on the Cray XK7 Titan system.