Porting MPI Communication Structures in SHOC

There were two main tasks required to support the communication requirements of the SHOC benchmarks in OpenSHMEM: MPI-style synchronization collectives and process teams. These were required by the four truly parallel (TP) benchmarks that used inter-device communication and synchronization. Table 1 summarizes the requirements that were implemented for each of the TP benchmarks.

Table 1. SHOC benchmark requirements




Team Split, Team Barrier, Team Broadcast, Team AllReduce Sum


Parallel Prefix Scan


AllReduce Sum


Reduce Sum, AllReduce Sum

Parallel Results DBa


aParallel results database is used in all EP and TP benchmarks.

Process Teams for Gradual Reduction of Devices

The QTC benchmark iteratively clusters elements into groups. At each iteration, the number of elements to be clustered shrinks, meaning that eventually there may be too few elements to use all of the GPU devices in the systems. This pattern is representative of many iterative clustering algorithms, and could be used in various data mining applications.

At a very high level, QTC executes the following:

  • 1: procedure QTC Main Loop
  • 2: Calculate total number of ranks needed for current work
  • 3: if my rank is needed to do work then
  • 4: color ^ 1
  • 5: else
  • 6: color ^ 0
  • 7: end if
  • 8: mygroup ^ result of split mygroup by color
  • 9: if color == 0 then
  • 10: Exit Main Loop
  • 11: end if
  • 12: Move Data to CUDA device
  • 13: Find local results using CUDA device
  • 14: Use mygroup communicator to find global results using collectives
  • 15: Use global results to create work for next iteration
  • 16: goto top of main loop
  • 17: end procedure

To support this pattern, we ported the code in two stages. In the first stage, we used the Cray Message Passing Toolkit implementation of SHMEM, which provides several team based operations. We used the following:

void shmem_team_split(shmem_team_t parenGteam , int color , int key , shmem_team_t *newteam)

int shmem_team_translate_pe (shmem_team_t team1 , int team1_pe , shmem_team_t team2)

void shmem_team_barrier(shmem_team_t myteam, long *pSync) void shmem_team_free (shmem_team_t * newteam) int shmem_team_npes(shmem_team_t newteam) int shmem_team_mype(shmem_team_t newteam)

The first three functions are collectives that must be called by all team members. The second three can be called by any pe in the team. These functions provided two of the four requirements for QTC listed in Tablet. We implemented team based broadcast and reduction sum-to-all using these along with shmem_get, shmem_put, shmem_wait functions.

In the second phase of porting, we implemented a shmem_team_t type and the listed Cray SHMEM team function prototypes and on top of the OpenSHMEM API. The only difference in function prototype between our OpenSHMEM team functions and the Cray SHMEM functions was that our team split operation had the following prototype:

void shmem_team_split(shmem_team_t parenGteam , int color , int key, shmem_team_t * newteam, long * pSyncBar)

We had to add the synchronization barrier symmetric array because our team split operation was built on top of a team based gather collective that required a barrier.

< Prev   CONTENTS   Source   Next >