Math function overloading

Kokkos math function for half-precision types must use specialized FP16 hardware units if available (rather than the FP32 ones) to achieve peak performance. Therefore, traditional output-correctness tests for the half-precision types are not sufficient.

Kokkos relies on a complex implementation strategy with many macros for providing math function overload resolution. For half-precision types, Kokkos provides a templated fallback that casts to float, and the CUDA backend provides a non-template overload that uses the CUDA intrinsics. According to the best viable function rules, the non-template overload is selected.

Manual inspection of the source code to verify that the correct implementation path is maintained as the supported architecture set and code base evolve is challenging, and requires substantial review effort.

TestMax shows how implementation correctness can be confirmed through automated SASS analysis for Kokkos::fmax for which a CUDA intrinsic __hmax exists.

References:

https://github.com/kokkos/kokkos/pull/8719

class examples.kokkos.half.example_math.Method(*values)View on GitHub 

Bases: IntEnum

CUDA_HMAX = 0: Use the CUDA intrinsic __hmax.

FMAX = 1: Use fmax(float, float).

KOKKOS_FMAX = 2: Use Kokkos::fmax(Kokkos::Experimental::half_t, Kokkos::Experimental::half_t).

class examples.kokkos.half.example_math.TestMaxView on GitHub 

Bases: CMakeAwareTestCase

SIGNATURE_TEMPLATE: Final[str] = 'void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<reprospect::examples::kokkos::half::FunctorMax<\\(reprospect::examples::kokkos::half::Method\\){method}, Kokkos::View<Kokkos::Experimental::Impl::floating_point_wrapper<__half>\\s*\\*, Kokkos::Cuda>>, Kokkos::RangePolicy<>, Kokkos::Cuda>>'

property cubin: PathView on GitHub 

cuobjdump() → CuObjDumpView on GitHub 

decoder(cuobjdump: CuObjDump) → dict[Method, Decoder]View on GitHub 

classmethod get_target_name() → strView on GitHub 

match_block_and_loads(*, instructions: Sequence[Instruction]) → tuple[BasicBlock, int, InstructionMatch, InstructionMatch]View on GitHub 

Find the block with two 16-bit loads, such as:

LDG.E.U16 R2, [R2.64]
LDG.E.U16 R5, [R4.64]

match_fp16(instructions: Sequence[Instruction]) → NoneView on GitHub 

Typically:

LDG.E.U16 R2, desc[UR6][R2.64]
LDG.E.U16 R5, desc[UR6][R4.64]
...
HMNMX2 R5, R2.H0_H0, R5.H0_H0, !PT
...
STG.E.U16 desc[UR6][R6.64], R5

match_fp16_to_fp32(*, src_a: str, src_b: str, instructions: Sequence[Instruction]) → tuple[int, InstructionMatch, InstructionMatch]View on GitHub : Conversion from FP16 to FP32.

match_fp32(instructions: Sequence[Instruction]) → NoneView on GitHub 

Typically:

LDG.E.U16 R2, desc[UR6][R2.64]
LDG.E.U16 R4, desc[UR6][R4.64]
...
HADD2.F32 R6, -RZ, R2.H0_H0
HADD2.F32 R7, -RZ, R4.H0_H0
FMNMX R6, R6, R7, !PT
F2FP.F16.F32.PACK_AB R3, RZ, R6
...
STG.E.U16 desc[UR6][R6.64], R3

match_fp32_to_fp16(*, src: str, instructions: Sequence[Instruction]) → tuple[int, InstructionMatch]View on GitHub : Convert from FP32 to FP16.

match_store(src: str, instructions: Sequence[Instruction]) → NoneView on GitHub 

test_cuda_hmax(decoder: dict[Method, Decoder]) → NoneView on GitHub : Check SASS code for Method.CUDA_HMAX.

Note

Before compute capability 8.0, the intrinsic __hmax generates FP32 instructions.

test_fmax(decoder: dict[Method, Decoder]) → NoneView on GitHub : Check SASS code for Method.FMAX.

test_kokkos_fmax(decoder: dict[Method, Decoder]) → NoneView on GitHub : Check SASS code for Method.KOKKOS_FMAX.

Note

It always leads to the exact same SASS code as Method.CUDA_HMAX, thus confirming that it is implemented correctly.