<cuda/std/bit>
CUDA Performance Considerations
Given an unsigned integer with N bits and N <= 32, the <bit> functions translate into the following SASS instructions. For some functions, the results is decorated with a compile-time assumption to restrict its range and allowing further optimizations.
bit_width()translates into a singleFLOSASS instruction. The result is assumed to be in the range[0, N].bit_ceil()translates intoADD, FLO, SHL, IMINMAXSASS instructions. The result is assumed to be greater than or equal to the input.bit_floor()translates intoFLO, SHLSASS instructions. The result is assumed to be less than or equal to the input.byteswap()translates into a singlePRMTSASS instruction.popcount()translates into a singlePOPCSASS instruction. The result is assumed to be in the range[0, N].has_single_bit()translates intoPOPC + ISETPSASS instructions.rotl()/rotr()translate into a singleSHF(funned shift) SASS instruction.countl_zero()translates intoFLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].countl_one()translates intoLOP3, FLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].countr_zero()translates intoBREV, FLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].countr_one()translates intoLOP3, BREV, FLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].
Additional Notes
All functions are marked
[[nodiscard]]andnoexceptAll functions support 128-bit integer types
bit_ceil()checks for overflow in debug mode
Note
When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the functions could not be optimal.
Note
GCC <= 8 uses a slow path with more instructions even in CUDA