We present three possible methods for moving management of synchronisation buffers to the implementation. These all rely in some way upon maintaining multiple buffers and using a set of locks to determine which buffers are available for reuse.
This method involves maintaining n buffers for each team of PEs. These buffers can be internally managed the same way as pSync arrays traditionally have been, except that each has a lock associated with it managed by the local PE. When a collective operation on a team is started, the PE finds the first free buffer and uses it for the operation. After the operation is complete, it then needs to find the next free buffer to use for barrier synchronisation, effectively ensuring that no PE leaves the collective operation before they have all completed it. Thus, upon finishing the barrier, it is known that no prior operation is still using their respective buffers, so at this point their buffers may all be unlocked.
Without support for threads or non-blocking operations, n = 3 buffers per team is sufficient for this approach - two for each pair of implementation-added barriers, and one for the operation performed in between them. Adding thread support may require up to n = 3t buffers, where t is the number of threads. Furthermore, the locks must then track not just the lock state of a buffer, but also which thread ID locked it. This ID must be used when unlocking buffers so that only those used by the active thread may be unlocked. However, since n = 3t buffers are only needed if all threads are actively participating in collective operations at a given time, it is not necessary to allocate the maximum number of buffers all at once, but instead it is possible to dynamically add more as necessary. Since synchronisation is already being performed internally within each collective, this approach has the potential to also benefit from “free” allocation in the sense of not requiring an additional barrier, provided sufficient memory remains.
Further adding support for non-blocking collective operations requires n to become unbounded, as any number of operations may be started before previous ones have completed. As such, the pool of available buffers must be handled in a way that allocates more on demand when the pool is exhausted. The locked buffers must also be ordered by when their associated collective is waited on, such that a given synchronisation can only unlock buffers whose respective operations have already been waited on.