Integration tests

Tests that exercise many features, but do not seek to exemplify use cases. For use cases, see Examples.

Half types

Analyze two kernels that compute the square of half-precision values with and without packing using:

class tests.integration.test_half.TestNCUView on GitHub

Bases: object

ncu-based analysis of the individual vs packed implementation.

BLOCK_DIM_X: Final[dict[str, int]] = {'individual': 129, 'packed': 65}
HALF: Final[Path] = PosixPath('tests/assets/tests_assets_half')
METRICS: Final[tuple[Metric | MetricCorrelation | MetricDeviceAttribute, ...]] = (MetricDeviceAttribute(name='display_name'), MetricCounter(name='smsp__sass_inst_executed_op_global_ld', pretty_name='L1/TEX cache global load instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_requests_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load requests', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), Metric(name='launch__grid_dim_x', pretty_name='launch__grid_dim_x', subs=None), Metric(name='launch__grid_dim_y', pretty_name='launch__grid_dim_y', subs=None), Metric(name='launch__grid_dim_z', pretty_name='launch__grid_dim_z', subs=None), Metric(name='launch__block_dim_x', pretty_name='launch__block_dim_x', subs=None), Metric(name='launch__block_dim_y', pretty_name='launch__block_dim_y', subs=None), Metric(name='launch__block_dim_z', pretty_name='launch__block_dim_z', subs=None))
SIZE: Final[int] = 129

Buffer size.

SIZEOF: Final[int] = 2

Size of __half in bytes.

WARP_SIZE: Final[int] = 32
pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
results(workdir: Path, bindir: Path) ProfilingResultsView on GitHub
test_memory(results: ProfilingResults) NoneView on GitHub

Compare the memory traffic.

class tests.integration.test_half.TestSASSView on GitHub

Bases: object

Tests that combine different half-precision SASS instructions.

FILE: Final[Path] = PosixPath('/__w/reprospect/reprospect/tests/assets/half.cu')
cuobjdump(workdir: Path, parameters: Parameters, cmake_file_api: FileAPI) CuObjDumpView on GitHub
pytestmark = [Mark(name='parametrize', args=('parameters', (Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.VOLTA: 'VOLTA'>, compute_capability=ComputeCapability(major=7, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.TURING: 'TURING'>, compute_capability=ComputeCapability(major=7, minor=5))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.AMPERE: 'AMPERE'>, compute_capability=ComputeCapability(major=8, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.AMPERE: 'AMPERE'>, compute_capability=ComputeCapability(major=8, minor=6))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.ADA: 'ADA'>, compute_capability=ComputeCapability(major=8, minor=9))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.HOPPER: 'HOPPER'>, compute_capability=ComputeCapability(major=9, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.BLACKWELL: 'BLACKWELL'>, compute_capability=ComputeCapability(major=10, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.BLACKWELL: 'BLACKWELL'>, compute_capability=ComputeCapability(major=12, minor=0))))), kwargs={'ids': <class 'str'>, 'scope': 'class'})]
test_individual(parameters: Parameters, cuobjdump: CuObjDump) NoneView on GitHub

Analyse the individual implementation.

It loads only 16 bits at once and does a “broadcast” of the lower lane (H0_H0) because HMUL2 works with 2 lanes (packed instruction).

Typically:

LDG.E.U16.CONSTANT.SYS R2, [R2]
HMUL2 R0, R2.H0_H0, R2.H0_H0
STG.E.U16.SYS [R4], R0
test_packed(parameters: Parameters, cuobjdump: CuObjDump) NoneView on GitHub

Analyse the packed implementation.

First, there is a block that performs the “odd” element and therefore looks like the individual implementation:

LDG.E.U16.CONSTANT.SYS R2, [R2]
HMUL2 R0, R2.H0_H0, R2.H0_H0

Then, there is another block that performs the packed multiplication. It loads 32 bits at once. Typically:

LDG.E.CONSTANT.SYS R2, [R2]
HMUL2 R7, R2, R2

Note that, even though the PTX always reads:

mul.f16x2 %r8,%r9,%r9

ptxas may choose to use HFMA2:

HFMA2 R7, R2, R2, -RZ

or even HFMA2.MMA:

HFMA2.MMA R7, R2, R2, -RZ

instead of HMUL2, depending on the targeted architecture.