`<cuda/std/bit>`

CUDA Performance Considerations

Given an unsigned integer with N bits and N <= 32, the <bit> functions translate into the following SASS instructions. For some functions, the results is decorated with a compile-time assumption to restrict its range and allowing further optimizations.

bit_width() translates into a single FLO SASS instruction. The result is assumed to be in the range [0, N].
bit_ceil() translates into ADD, FLO, SHL, IMINMAX SASS instructions. The result is assumed to be greater than or equal to the input.
bit_floor() translates into FLO, SHL SASS instructions. The result is assumed to be less than or equal to the input.
byteswap() translates into a single PRMT SASS instruction.
popcount() translates into a single POPC SASS instruction. The result is assumed to be in the range [0, N].
has_single_bit() translates into POPC + ISETP SASS instructions.
rotl()/rotr() translate into a single SHF (funned shift) SASS instruction.
countl_zero() translates into FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].
countl_one() translates into LOP3, FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].
countr_zero() translates into BREV, FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].
countr_one() translates into LOP3, BREV, FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].

Additional Notes

All functions are marked [[nodiscard]] and noexcept
All functions support 128-bit integer types
bit_ceil() checks for overflow in debug mode

Note

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the functions could not be optimal.

Note

GCC <= 8 uses a slow path with more instructions even in CUDA

<cuda/std/bit>

CUDA Performance Considerations

Additional Notes

`<cuda/std/bit>`