The validation of this work was completed by making use of multiple benchmarks. The synthetic ping-pong benchmark, OSU, and SHOMS microbenchmarks were used to evaluate the performance of basic OpenSHMEM routines. Then we used the OpenSHMEM version of the HPCS SSCA 1 benchmark to evaluate the overall quality of the implementation in the context of a real-life computational kernel.
The results of these benchmarks will be detailed and analyzed in this section.
Short Message Latency
To measure the short message latency, we implemented a simple ping-pong benchmark. Each execution of the benchmark was completed with 1 PE across two nodes. A Put operation of varying size is done between each PE, where the remote PE is waiting for the message. When the message is received, the receiver will write a message back to the originating PE. The results of the experiment can be seen in Fig. 4.
Fig. 4. Ping pong results. Lower is better.
The performance of Cray SHMEM is a slightly better than the OpenSHMEM- UCX implementation. The round-trip latency of OpenSHMEM-UCX and Cray
SHMEM is 1.84 ps and 1.51 ps respectively. We believe the performance difference is due to the difference in the completion semantics of both implementations. The advantages of our approach can be seen in the message rate.