cuda::device::memcpy_async_tx

Defined in header <cuda/barrier>:

template <typename T, size_t Alignment>
inline __device__
void cuda::device::memcpy_async_tx(
  T* dest,
  const T* src,
  cuda::aligned_size_t<Alignment> size,
  cuda::barrier<cuda::thread_scope_block>& bar);

Copies size bytes from global memory src to shared memory dest and decrements the transaction count of bar by size bytes.

Preconditions

src, dest are 16-byte aligned and size is a multiple of 16, i.e., Alignment >= 16.
dest points to a shared memory allocation that is at least size bytes wide.
src points to a global memory allocation that is at least size bytes wide.
bar is located in shared memory
If either destination or source is an invalid or null pointer, the behavior is undefined (even if count is zero).

Requires

is_trivially_copyable_v<T> is true.

Notes

This function can only be used under CUDA Compute Capability 9.0 (Hopper) or higher.

There is no feature flag to check if cuda::device::memcpy_async_tx is available.

Comparison to cuda::memcpy_async: memcpy_async_tx supports a subset of the operations of memcpy_async. It gives more control over the synchronization with a barrier than memcpy_async. Currently, memcpy_async_tx has no synchronous fallback mechanism., i.e., it currently does not work on older hardware (pre-CUDA Compute Capability 9.0, i.e., Hopper).

Example

#include <cuda/barrier>
#include <cuda/std/utility> // cuda::std::move

#if defined(__CUDA_MINIMUM_ARCH__) && __CUDA_MINIMUM_ARCH__ < 900
static_assert(false, "Insufficient CUDA Compute Capability: cuda::device::memcpy_async_tx is not available.");
#endif // __CUDA_MINIMUM_ARCH__

__device__ alignas(16) int gmem_x[2048];

__global__ void example_kernel() {
  alignas(16) __shared__ int smem_x[1024];
  __shared__ cuda::barrier<cuda::thread_scope_block> bar;
  if (threadIdx.x == 0) {
    init(&bar, blockDim.x);
  }
  __syncthreads();

  barrier::arrival_token token;
  if (threadIdx.x == 0) {
    cuda::device::memcpy_async_tx(smem_x, gmem_x, cuda::aligned_size_t<16>(sizeof(smem_x)), bar);
    token = cuda::device::barrier_arrive_tx(bar, 1, sizeof(smem_x));
  } else {
    token = bar.arrive(1);
  }
  bar.wait(cuda::std::move(token));

  // smem_x contains the contents of gmem_x[0], ..., gmem_x[1023]
  smem_x[threadIdx.x] += 1;
}

See it on Godbolt