Math function overloading

Kokkos math function for half-precision types must use specialized FP16 hardware units if available (rather than the FP32 ones) to achieve peak performance. Therefore, traditional output-correctness tests for the half-precision types are not sufficient.

Kokkos relies on a complex implementation strategy with many macros for providing math function overload resolution. For half-precision types, Kokkos provides a templated fallback that casts to float, and the CUDA backend provides a non-template overload that uses the CUDA intrinsics. According to the best viable function rules, the non-template overload is selected.

Manual inspection of the source code to verify that the correct implementation path is maintained as the supported architecture set and code base evolve is challenging, and requires substantial review effort.

TestMax shows how implementation correctness can be confirmed through automated SASS analysis for Kokkos::fmax for which a CUDA intrinsic __hmax exists.

References:

class examples.kokkos.half.example_math.Method(*values)View on GitHub

Bases: IntEnum

CUDA_HMAX = 0

Use the CUDA intrinsic __hmax.

FMAX = 1

Use fmax(float, float).

KOKKOS_FMAX = 2

Use Kokkos::fmax(Kokkos::Experimental::half_t, Kokkos::Experimental::half_t).

class examples.kokkos.half.example_math.TestMaxView on GitHub

Bases: CMakeAwareTestCase

SIGNATURE_TEMPLATE: Final[str] = 'void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<reprospect::examples::kokkos::half::FunctorMax<\\(reprospect::examples::kokkos::half::Method\\){method}, Kokkos::View<Kokkos::Experimental::Impl::floating_point_wrapper<__half>\\s*\\*, Kokkos::Cuda>>, Kokkos::RangePolicy<>, Kokkos::Cuda>>'
property cubin: PathView on GitHub
cuobjdump() CuObjDumpView on GitHub
decoder(cuobjdump: CuObjDump) dict[Method, Decoder]View on GitHub
classmethod get_target_name() strView on GitHub
match_block_and_loads(*, instructions: Sequence[Instruction]) tuple[BasicBlock, int, InstructionMatch, InstructionMatch]View on GitHub

Find the block with two 16-bit loads, such as:

LDG.E.U16 R2, [R2.64]
LDG.E.U16 R5, [R4.64]
match_fp16(instructions: Sequence[Instruction]) NoneView on GitHub

Typically:

LDG.E.U16 R2, desc[UR6][R2.64]
LDG.E.U16 R5, desc[UR6][R4.64]
...
HMNMX2 R5, R2.H0_H0, R5.H0_H0, !PT
...
STG.E.U16 desc[UR6][R6.64], R5
match_fp16_to_fp32(*, src_a: str, src_b: str, instructions: Sequence[Instruction]) tuple[int, InstructionMatch, InstructionMatch]View on GitHub

Conversion from FP16 to FP32.

match_fp32(instructions: Sequence[Instruction]) NoneView on GitHub

Typically:

LDG.E.U16 R2, desc[UR6][R2.64]
LDG.E.U16 R4, desc[UR6][R4.64]
...
HADD2.F32 R6, -RZ, R2.H0_H0
HADD2.F32 R7, -RZ, R4.H0_H0
FMNMX R6, R6, R7, !PT
F2FP.F16.F32.PACK_AB R3, RZ, R6
...
STG.E.U16 desc[UR6][R6.64], R3
match_fp32_to_fp16(*, src: str, instructions: Sequence[Instruction]) tuple[int, InstructionMatch]View on GitHub

Convert from FP32 to FP16.

match_store(src: str, instructions: Sequence[Instruction]) NoneView on GitHub
test_cuda_hmax(decoder: dict[Method, Decoder]) NoneView on GitHub

Check SASS code for Method.CUDA_HMAX.

Note

Before compute capability 8.0, the intrinsic __hmax generates FP32 instructions.

test_fmax(decoder: dict[Method, Decoder]) NoneView on GitHub

Check SASS code for Method.FMAX.

test_kokkos_fmax(decoder: dict[Method, Decoder]) NoneView on GitHub

Check SASS code for Method.KOKKOS_FMAX.

Note

It always leads to the exact same SASS code as Method.CUDA_HMAX, thus confirming that it is implemented correctly.