Math function overloading
Kokkos math function for half-precision types
must use specialized FP16 hardware units if available (rather than the FP32 ones)
to achieve peak performance.
Therefore, traditional output-correctness tests for the half-precision types are
not sufficient.
Kokkos relies on a complex implementation strategy with many macros
for providing math function overload resolution.
For half-precision types, Kokkos provides a
templated fallback
that casts to float, and the CUDA backend provides a
non-template overload
that uses the CUDA intrinsics.
According to the
best viable function rules,
the non-template overload is selected.
Manual inspection of the source code to verify that the correct implementation path is maintained as the supported architecture set and code base evolve is challenging, and requires substantial review effort.
TestMax shows how implementation correctness can be confirmed through automated SASS analysis for Kokkos::fmax
for which a CUDA intrinsic __hmax exists.
References:
- class examples.kokkos.half.example_math.Method(*values)View on GitHub
Bases:
IntEnum- CUDA_HMAX = 0
Use the CUDA intrinsic
__hmax.
- FMAX = 1
Use
fmax(float, float).
- KOKKOS_FMAX = 2
Use
Kokkos::fmax(Kokkos::Experimental::half_t, Kokkos::Experimental::half_t).
- class examples.kokkos.half.example_math.TestMaxView on GitHub
Bases:
CMakeAwareTestCase- SIGNATURE_TEMPLATE: Final[str] = 'void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<reprospect::examples::kokkos::half::FunctorMax<\\(reprospect::examples::kokkos::half::Method\\){method}, Kokkos::View<Kokkos::Experimental::Impl::floating_point_wrapper<__half>\\s*\\*, Kokkos::Cuda>>, Kokkos::RangePolicy<>, Kokkos::Cuda>>'
- property cubin: PathView on GitHub
- cuobjdump() CuObjDumpView on GitHub
- classmethod get_target_name() strView on GitHub
- match_block_and_loads(*, instructions: Sequence[Instruction]) tuple[BasicBlock, int, InstructionMatch, InstructionMatch]View on GitHub
Find the block with two 16-bit loads, such as:
LDG.E.U16 R2, [R2.64] LDG.E.U16 R5, [R4.64]
- match_fp16(instructions: Sequence[Instruction]) NoneView on GitHub
Typically:
LDG.E.U16 R2, desc[UR6][R2.64] LDG.E.U16 R5, desc[UR6][R4.64] ... HMNMX2 R5, R2.H0_H0, R5.H0_H0, !PT ... STG.E.U16 desc[UR6][R6.64], R5
- match_fp16_to_fp32(*, src_a: str, src_b: str, instructions: Sequence[Instruction]) tuple[int, InstructionMatch, InstructionMatch]View on GitHub
Conversion from FP16 to FP32.
- match_fp32(instructions: Sequence[Instruction]) NoneView on GitHub
Typically:
LDG.E.U16 R2, desc[UR6][R2.64] LDG.E.U16 R4, desc[UR6][R4.64] ... HADD2.F32 R6, -RZ, R2.H0_H0 HADD2.F32 R7, -RZ, R4.H0_H0 FMNMX R6, R6, R7, !PT F2FP.F16.F32.PACK_AB R3, RZ, R6 ... STG.E.U16 desc[UR6][R6.64], R3
- match_fp32_to_fp16(*, src: str, instructions: Sequence[Instruction]) tuple[int, InstructionMatch]View on GitHub
Convert from FP32 to FP16.
- match_store(src: str, instructions: Sequence[Instruction]) NoneView on GitHub
- test_cuda_hmax(decoder: dict[Method, Decoder]) NoneView on GitHub
Check SASS code for
Method.CUDA_HMAX.Note
Before compute capability 8.0, the intrinsic
__hmaxgenerates FP32 instructions.
- test_fmax(decoder: dict[Method, Decoder]) NoneView on GitHub
Check SASS code for
Method.FMAX.
- test_kokkos_fmax(decoder: dict[Method, Decoder]) NoneView on GitHub
Check SASS code for
Method.KOKKOS_FMAX.Note
It always leads to the exact same SASS code as
Method.CUDA_HMAX, thus confirming that it is implemented correctly.