Integration tests

Tests that exercise many features, but do not seek to exemplify use cases. For use cases, see Examples.

Half types

Analyze two kernels that compute the square of half-precision values with and without packing using:

class tests.integration.test_half.TestNCUView on GitHub 

ncu-based analysis of the individual vs packed implementation.

METRICS: Final[tuple[Metric | MetricCorrelation | MetricDeviceAttribute, ...]] = (MetricDeviceAttribute(name='display_name'), MetricCounter(name='smsp__sass_inst_executed_op_global_ld', pretty_name='L1/TEX cache global load instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_requests_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load requests', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), Metric(name='launch__grid_dim_x', pretty_name='launch__grid_dim_x', subs=None), Metric(name='launch__grid_dim_y', pretty_name='launch__grid_dim_y', subs=None), Metric(name='launch__grid_dim_z', pretty_name='launch__grid_dim_z', subs=None), Metric(name='launch__block_dim_x', pretty_name='launch__block_dim_x', subs=None), Metric(name='launch__block_dim_y', pretty_name='launch__block_dim_y', subs=None), Metric(name='launch__block_dim_z', pretty_name='launch__block_dim_z', subs=None))

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

test_memory(results: ProfilingResults) → NoneView on GitHub : Compare the memory traffic.

class tests.integration.test_half.TestSASSView on GitHub 

Tests that combine different half-precision SASS instructions.

FILE: Final[Path] = PosixPath('/__w/reprospect/reprospect/tests/assets/half.cu')

cuobjdump(workdir: Path, parameters: Parameters, cmake_file_api: FileAPI) → CuObjDumpView on GitHub 

pytestmark = [Mark(name='parametrize', args=('parameters', (Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.VOLTA: 'VOLTA'>, compute_capability=ComputeCapability(major=7, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.TURING: 'TURING'>, compute_capability=ComputeCapability(major=7, minor=5))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.AMPERE: 'AMPERE'>, compute_capability=ComputeCapability(major=8, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.AMPERE: 'AMPERE'>, compute_capability=ComputeCapability(major=8, minor=6))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.ADA: 'ADA'>, compute_capability=ComputeCapability(major=8, minor=9))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.HOPPER: 'HOPPER'>, compute_capability=ComputeCapability(major=9, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.BLACKWELL: 'BLACKWELL'>, compute_capability=ComputeCapability(major=10, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.BLACKWELL: 'BLACKWELL'>, compute_capability=ComputeCapability(major=12, minor=0))))), kwargs={'ids': <class 'str'>, 'scope': 'class'})]

test_individual(parameters: Parameters, cuobjdump: CuObjDump) → NoneView on GitHub 

Analyse the individual implementation.

It loads only 16 bits at once and does a “broadcast” of the lower lane (H0_H0) because HMUL2 works with 2 lanes (packed instruction).

Typically:

LDG.E.U16.CONSTANT.SYS R2, [R2]
HMUL2 R0, R2.H0_H0, R2.H0_H0
STG.E.U16.SYS [R4], R0

test_packed(parameters: Parameters, cuobjdump: CuObjDump) → NoneView on GitHub 

Analyse the packed implementation.

First, there is a block that performs the “odd” element and therefore looks like the individual implementation:

LDG.E.U16.CONSTANT.SYS R2, [R2]
HMUL2 R0, R2.H0_H0, R2.H0_H0

Then, there is another block that performs the packed multiplication. It loads 32 bits at once. Typically:

LDG.E.CONSTANT.SYS R2, [R2]
HMUL2 R7, R2, R2

Note that, even though the PTX always reads:

mul.f16x2 %r8,%r9,%r9

ptxas may choose to use HFMA2:

HFMA2 R7, R2, R2, -RZ

or even HFMA2.MMA:

HFMA2.MMA R7, R2, R2, -RZ

instead of HMUL2, depending on the targeted architecture.